WcodeW is a βweb β code β webβ closed-loop benchmark and viewer designed to measure how faithfully an LLM Agent can recreate real web pages from static specifications, without access to the actual URL. It first archives the real page with Playwright, then asks the agent to generate a single-file, self-contained index.html, and finally compares the generated result against the archived snapshot item by item.
Its evaluation covers four main dimensions: visual, DOM, interaction, and agentic judge. Visual SSIM carries a 50% weight, DOM similarity 30%, interaction execution and post-state matching 5%, and LLM-based semantic judgment 15%. The viewer also provides a per-pixel diff percentage, making it easier to quickly identify differences visible to the naked eye. The frontend supports four modes: iframe, screenshot, diff, and code. Users can browse different bundles, viewports, and scroll steps via a slider, matrix view, and gallery.
The project uses Playwright to capture the DOM, accessibility tree, network responses, and screenshots. The viewer is built with plain HTML, ES modules, and CSS, with no build dependency. It can run offline or be deployed to GitHub Pages. The data layer provides wclone-export.csv and multiple JSON indexes, making it suitable for import into spreadsheets or pandas for further analysis. The source code is released on GitHub under the MIT license and has the basics needed for self-hosting.
No commercial pricing or paid plans are mentioned in the main text. As an MIT open-source project, it can be understood as free to use. The documentation explains scoring weights, view modes, limitations, keyboard shortcuts, and data export fairly clearly. However, adding a new bundle or agent run requires placing HTML files in the correct directories and running scripts, while bundle creation details are only referenced via the annotator playbook. Overall, it still feels more like a research and engineering tool.
Its strengths include a transparent evaluation workflow, strong visual comparison, analysis-friendly metric exports, and simple deployment. The limitations are also clear: it evaluates static visual replication, not a production-ready functional replacement; JavaScript is disabled in sandboxed iframes, so complex dynamic effects cannot be covered; and pixel diffs are sensitive to tiny shifts. It is best suited for LLM Agent researchers, AI coding tool teams, and evaluators of web page generation models.
The main text does not provide information about mainland China network access, mirrors, payment, or service support, so site accessibility can only be marked as unknown. If GitHub access is unstable, self-hosting the static files may be more reliable. Alternative directions include WebArena, VisualWebArena, and BrowserGym; for visual regression scenarios, Playwright screenshot diff, Percy, or Chromatic can be used.
β This review is compiled from public sources and does not constitute a purchase recommendation. Verify all facts on the vendor's official site. Verify on site-bench.com official site.
site-bench.com is an United States Dev Tools provider. TG4G tracks its product information, an overall rating of 7.0/10, and a China-accessibility score of China direct-connect friendly. Click "Visit Official Site" to reach site-bench.com directly.