Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
Abstract
Vision2Web presents a comprehensive benchmark for visual website development tasks and evaluates coding agents across static UI generation, interactive frontend reproduction, and full-stack development with varying complexity levels.
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
Community
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/vision2web-a-hierarchical-benchmark-for-visual-website-development-with-agent-verification-8720-222c5368
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants (2026)
- BrowseComp-V3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents (2026)
- LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces (2026)
- WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing (2026)
- ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation (2026)
- PaperVoyager : Building Interactive Web with Visual Language Models (2026)
- VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.26648 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper