Dimension scores are derived from public data and fields; weighted into the composite. Reference only.
ClawMark is an AI coworker agents benchmark launched by Evolvent AI/ClawMark Team. Its goal is not to provide a ready-to-use office Agent product, but to systematically evaluate models’ long-horizon, multi-tool, multimodal collaboration capabilities in real-world workflows. It includes 100 tasks across 13 professional domains. Each task simulates 1–3 workdays and covers scenarios such as research, content operations, HR, ecommerce, news, product management, and insurance.
Its core design is the “omni setting”: an Agent must switch between environments such as the file system, email, Notion mock, Google Sheets mock, and Calendar, while processing raw evidence including screenshots, photos, PDFs, CSVs, audio, and video. Tasks also include implicit state changes over time, such as new emails arriving, database updates, and calendar adjustments, testing whether a model can proactively refresh external state and make decisions continuously. Scoring is based entirely on 10–25 deterministic Python checkers, with no LLM-as-judge involved, so results are reproducible. It can also output per-checker pass/fail results, messages.jsonl, and the final workspace, making it easier to identify why a run failed.
The page shows benchmark rankings, token usage, and estimated costs for GPT-5.4, Claude 4.6 Sonnet, Qwen 3.6 Plus, Gemini 3.1 Pro Preview, and MiniMax M2.7. ClawMark itself does not disclose any pricing. The project can be cloned from GitHub and run locally with Docker, but a full evaluation requires bringing your own model API keys and paying for model usage. The costs listed on the page range from about $53 for MiniMax M2.7 to about $946 for Claude 4.6 Sonnet.
Its strengths are that the tasks closely resemble real office work, cover multimodal inputs and cross-tool collaboration, and use rule-based scoring that is more stable than subjective review. It also provides a clear project structure, Quick Start, environment configuration, and smoke tests, making it suitable for reproducible experiments. The drawbacks are that it is not an out-of-the-box tool for ordinary users: it requires configuring Docker, uv, model APIs, and Notion and Google Sheets credentials. Full runs are not cheap, and the page does not explain commercial support, privacy compliance, or SLA.
ClawMark is better suited to AI Agent researchers, model evaluation teams, enterprise AI application labs, and framework developers. It can be used for model comparisons, validation of complex office automation, and regression testing. The page does not provide clear information about access from China. Because it depends on GitHub, some overseas model APIs, Notion/Google services, and similar resources, actual usability may be affected by network and payment conditions. Alternative or comparative benchmarks to watch include GAIA, WebArena, OSWorld, AgentBench, SWE-bench, and τ-bench.
⚠ This review is compiled from public sources and does not constitute a purchase recommendation. Verify all facts on the vendor's official site. Verify on claw-mark.com official site.
claw-mark.com is an United States AI Apps provider. TG4G tracks its product information, an overall rating of 7.0/10, and a China-accessibility score of China direct-connect friendly. Click "Visit Official Site" to reach claw-mark.com directly.