Dimension scores are derived from public data and fields; weighted into the composite. Reference only.
mage-bench is a large language model benchmarking and observability project that pits LLMs against each other in Magic: The Gathering using the complete XMage rules engine. It is not a simplified card simulator; instead, it exposes real game states to the models, lets them choose legal actions, and has the engine resolve the consequences according to human game rules.
The project focuses on testing complex strategic reasoning: hidden information, stack interaction, combat calculation, priority, chained side effects, and multi-turn planning. The website shows Season 2 has featured 214 games, 36 models, and 5 formats, providing rankings by format and overall Elo. Its observability features include replays, logs, derived statistics, and blunder analysis for completed matches, which estimates whether a model made strategically poor choices rather than just looking at wins and losses.
Models appearing on the leaderboard include Claude Opus, GPT, Gemini, DeepSeek, Qwen, Llama, GLM, and Grok. The Season 1 champion was Gemini 3 Pro, while the current top of the Season 2 leaderboard includes Claude Opus 4.6, GPT-5.2, GPT-5.3 Codex, Gemini 3 Pro, and DeepSeek V3.2. It's worth noting that the author explicitly states that even frontier models currently "play very poorly," making it more suitable for cross-model comparison rather than finding the ultimate Magic: The Gathering bot.
The scraped text provides no information on pricing, free trials, payment methods, or commercial services. Technically, the project is a fork of XMage equipped with a harness that allows LLM agents to control decks via structured tools; however, it does not mention any public API, SDK, or process for externally submitting models. The text also lacks explanations regarding data privacy, log retention, or security policies.
Pros include a complex, dynamic, and highly competitive evaluation environment that better observes models' real decision-making under long-term planning and rule constraints; Elo ratings, replays, and logs also facilitate research and analysis. Cons are the high barrier to entry, narrow use case, and the fact that blunder analysis is currently deemed unreliable by the author, along with a lack of Chinese language, pricing, and service support information. It is suitable for LLM researchers, agent developers, and model evaluation teams, but not for general productivity users.
The text provides no information on access from mainland China, ICP filing, payments, or mirrors, so china_access can only be marked as unknown. For alternatives, you can refer to more general model evaluation systems like Chatbot Arena, HELM, SWE-bench, and AgentBench.
⚠ This review is compiled from public sources and does not constitute a purchase recommendation. Verify all facts on the vendor's official site. Verify on mage-bench.com official site.
mage-bench.com is an Unknown Site Builders provider. TG4G tracks its product information, an overall rating of 7.0/10, and a China-accessibility score of China direct-connect friendly. Click "Visit Official Site" to reach mage-bench.com directly.