What is mage-bench.com?

mage-bench.com is a Unknown-based Site Builders provider. A rare LLM evaluation project, worth following for AI researchers.

Is mage-bench.com good? Is it worth it?

mage-bench.com scores 7.0/10 on TG4G — a solid rating, based in 未知. See the in-depth review below for pros, cons and China accessibility.

Is mage-bench.com usable in China?

mage-bench.com offers good direct-connect performance in mainland China and works in most regions without a proxy. The provider is headquartered in Unknown and primarily serves overseas markets.

How do I sign up for mage-bench.com?

Visit the mage-bench.com official site to complete sign-up. Registration typically requires an email (Gmail/Outlook recommended) and a payment method. Most overseas services accept credit card / PayPal / crypto. See the "Visit Official Site" button on this page for the direct link.

🧱 Site Builders 📍 HQ: Unknown

M

mage-bench.com

Name: mage-bench.com
Brand: mage-bench.com
Rating: 7.0 (1 reviews)

Overall Rating

★★★⯨☆ 7.0/10

China Access

★★★ China direct-connect friendly

Quick Check

🔎 Is any site accessible in China? →

Data source

ai_crawl · Last updated 2026-06-12

⚡ Score breakdown

5-dim weighted · /10

Performance25% 7.0

Value20% 7.0

China access20% 10.0

Reputation20% 6.0

Support15% 6.5

Dimension scores are derived from public data and fields; weighted into the composite. Reference only.

Editorial Highlights

A rare LLM evaluation project, worth following for AI researchers.

In-Depth Review TG4G Review ·2026-06-07 · For reference only

What It Is

mage-bench is a large language model benchmarking and observability project that pits LLMs against each other in Magic: The Gathering using the complete XMage rules engine. It is not a simplified card simulator; instead, it exposes real game states to the models, lets them choose legal actions, and has the engine resolve the consequences according to human game rules.

Core Capabilities & Evaluation Dimensions

The project focuses on testing complex strategic reasoning: hidden information, stack interaction, combat calculation, priority, chained side effects, and multi-turn planning. The website shows Season 2 has featured 214 games, 36 models, and 5 formats, providing rankings by format and overall Elo. Its observability features include replays, logs, derived statistics, and blunder analysis for completed matches, which estimates whether a model made strategically poor choices rather than just looking at wins and losses.

Models & Results

Models appearing on the leaderboard include Claude Opus, GPT, Gemini, DeepSeek, Qwen, Llama, GLM, and Grok. The Season 1 champion was Gemini 3 Pro, while the current top of the Season 2 leaderboard includes Claude Opus 4.6, GPT-5.2, GPT-5.3 Codex, Gemini 3 Pro, and DeepSeek V3.2. It's worth noting that the author explicitly states that even frontier models currently "play very poorly," making it more suitable for cross-model comparison rather than finding the ultimate Magic: The Gathering bot.

Pricing, Integration & Privacy

The scraped text provides no information on pricing, free trials, payment methods, or commercial services. Technically, the project is a fork of XMage equipped with a harness that allows LLM agents to control decks via structured tools; however, it does not mention any public API, SDK, or process for externally submitting models. The text also lacks explanations regarding data privacy, log retention, or security policies.

Pros, Cons & Who It's For

Pros include a complex, dynamic, and highly competitive evaluation environment that better observes models' real decision-making under long-term planning and rule constraints; Elo ratings, replays, and logs also facilitate research and analysis. Cons are the high barrier to entry, narrow use case, and the fact that blunder analysis is currently deemed unreliable by the author, along with a lack of Chinese language, pricing, and service support information. It is suitable for LLM researchers, agent developers, and model evaluation teams, but not for general productivity users.

Access from China

The text provides no information on access from mainland China, ICP filing, payments, or mirrors, so china_access can only be marked as unknown. For alternatives, you can refer to more general model evaluation systems like Chatbot Arena, HELM, SWE-bench, and AgentBench.

⚠ This review is compiled from public sources and does not constitute a purchase recommendation. Verify all facts on the vendor's official site. Verify on mage-bench.com official site.

About this entry

mage-bench.com is an Unknown Site Builders provider. TG4G tracks its product information, an overall rating of 7.0/10, and a China-accessibility score of China direct-connect friendly. Click "Visit Official Site" to reach mage-bench.com directly.