Dimension scores are derived from public data and fields; weighted into the composite. Reference only.
PX-bench is an AI coding agent product-experience evaluation benchmark from Chordio. Rather than focusing on isolated metrics such as whether the code runs or whether the interface looks good, it evaluates whether an agent can add features to an existing product codebase while handling structure, state, copy, conventions, and accessibility like an experienced product designer. The evaluation simulates a real development environment using retained multi-screen host applications, ordinary product briefs, sealed containers, and follow-up audits.
Its evaluation dimensions are divided into 8 categories: Intent fidelity, Product fit, Visual craft, Convention adherence, Pathway completeness, Content & language, Resilience, and Accessibility. Product fit and Convention adherence are especially valuable: the agent must understand the existing app’s local rules around drawers, naming, design tokens, data persistence, and more, rather than generating a superficially plausible UI on a blank canvas. Scoring is carried out through a scoring agent, script-based checks such as axe-core/structural-diff, automated navigation screenshots, and independent auditors. Public scenarios are based on Inspect AI and can be rerun.
The page does not disclose pricing, plans, payment methods, or SLA details. It only provides a contact entry point for private PX-bench evals and [email protected]. Whether PX-bench itself is open source is not stated, though its harness uses the open-source Inspect AI. The host applications and official scored scenarios are kept held-out and rotated, which helps prevent models from memorizing the benchmark, but also means external parties cannot fully reproduce the private evaluations.
The strengths are its rigorous methodology and close alignment with real product iteration. It can surface empty states, error states, long-content handling, mobile behavior, and accessibility issues that are hard to expose in happy-path demos. Reports can be broken down by category, making them useful for Agent R&D iteration. The downside is that the current information reads more like an evaluation-methodology paper: it lacks API, SDK, integration guides, CI integration, sample reports, and pricing details, so the trial threshold for ordinary teams is unclear.
PX-bench is better suited to AI coding agent vendors, model evaluation teams, and R&D organizations with mature frontend product-experience standards. It can be used to compare model versions and verify whether cost reduction or speed gains come at the expense of user experience. There is no evidence in the main text about access from China, so it should be considered unknown for now; payment methods are also not disclosed. If direct access is unavailable, teams can build an internal alternative benchmark using Inspect AI, Playwright, axe-core, and manual design reviews.
⚠ This review is compiled from public sources and does not constitute a purchase recommendation. Verify all facts on the vendor's official site. Verify on chordio.com official site.
chordio.com is an Unknown AI Apps provider. TG4G tracks its product information, an overall rating of 7.0/10, and a China-accessibility score of Workable. Click "Visit Official Site" to reach chordio.com directly.