Dimension scores are derived from public data and fields; weighted into the composite. Reference only.
MLEM Arena (Model Language Evaluation Matrix) is a blind A/B evaluation platform for small language models with ≤20B parameters. After a user enters a real prompt, two preconfigured models answer side by side. The platform hides the model identities before voting, and users can choose Model A, Model B, Tie, or Both Bad. Season 1 has been completed, covering 42 models and thousands of battles, and future runs are expected after new model releases.
The platform focuses on comparing small models such as Qwen, Gemma, Mistral, Llama, Phi, and DeepSeek, producing rankings across categories including coding, creative writing, math, translation, and general tasks. Each model starts with an Elo score of 1000, and scores are updated after voting using the standard Elo rule with K=24. Ties and “Both Bad” are treated as 0.5/0.5. Its main value is reducing prior bias from brand names and parameter counts through blind testing, making the results closer to users’ subjective experience.
The main text does not mention any paid plans, so it can currently be understood as a free community project. The site code has been open-sourced on GitHub, and users can submit Issues or PRs to add missing models. On privacy, the platform says it only uses a session cookie to ensure more varied model matching in battles and does not track other information. Battle data is stored locally in SQLite, with plans to make it available for download once enough data has been collected.
Its strengths are a clear positioning, transparent rules, coverage across multiple task categories, and disclosure of the hardware used: Ryzen 7 7800X3D, RTX 4070 Ti SUPER, and 32GB RAM. This helps users understand the local inference context behind the results. The limitations are also clear: the platform currently runs on a personal home computer, so availability may fluctuate with traffic; Season 1 has ended, and the timing of the next round is uncertain; the model pool depends on the Ollama list and preconfiguration; and community voting results are useful for reference but cannot replace rigorously standardized benchmarks.
MLEM Arena is suitable for AI developers, researchers, and open-source model enthusiasts looking for subjective quality references when choosing small models, especially for comparing locally runnable models on coding, translation, and general tasks. The main text does not provide information about access from mainland China, payment, or network reachability, so china_access is assessed as unknown. For more mature alternatives, consider lmarena.ai, Hugging Face Open LLM Leaderboard, and OpenCompass.
⚠ This review is compiled from public sources and does not constitute a purchase recommendation. Verify all facts on the vendor's official site. Verify on mlemarena.top official site.
mlemarena.top is an Unknown AI Apps provider. TG4G tracks its product information, an overall rating of 7.0/10, and a China-accessibility score of China direct-connect friendly. Click "Visit Official Site" to reach mlemarena.top directly.