RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation
Abstract
Vision-Language Models face significant challenges in generating complex multi-panel charts from real-world data, as demonstrated by a new large-scale benchmark that reveals performance gaps between proprietary and open-weight models.
Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \texttt{RealChart2Code}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on RealChart2Code reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at https://github.com/Speakn0w/RealChart2Code.
Community
Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their
ability to replicate complex, multi-panel visualizations from real-world data remains largely
unassessed. To address this gap, we introduce
RealChart2Code, a new large-scale benchmark with over 2,800 instances grounded in
authentic datasets and featuring tasks with clear
analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation
of 14 leading VLMs on RealChart2Code reveals significant performance degradation compared to simpler benchmarks, highlighting their
struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and
open-weight models and confirms that even
state-of-the-art VLMs often fail to accurately
replicate intricate, multi-panel charts. These
findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark
and code at https://github.com/Speakn0w/RealChart2Code.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ReCUBE: Evaluating Repository-Level Context Utilization in Code Generation (2026)
- QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs (2026)
- UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities (2026)
- Comprehensive Evaluation of Large Language Models on Software Engineering Tasks: A Multi-Task Benchmark (2026)
- Towards Green AI: Decoding the Energy of LLM Inference in Software Development (2026)
- RAL-Bench: Benchmarking for Application-Level Functional Correctness and Non-Functional Quality Attributes (2026)
- Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.25804 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper