Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the Effective Chart Dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets.
Evaluating Multimodal LLM performance on scientific charts with descriptive + reasoning Q&A.
Model | Reasoning | Descriptive | Average |
---|---|---|---|
o4‑mini | 57.03 | 77.45 | 67.24 |
o3 | 56.13 | 74.51 | 65.32 |
Gemini‑2.5‑Pro | 44.36 | 76.88 | 60.62 |
o1 | 40.52 | 74.18 | 57.35 |
Claude-4-Sonnet | 44.20 | 69.36 | 56.78 |
Claude-3.7-Sonnet | 43.38 | 69.61 | 56.50 |
Claude-3.5-Sonnet | 41.99 | 68.14 | 55.07 |
Qwen2.5‑VL‑72B | 38.81 | 68.46 | 53.64 |
GPT‑4o | 35.62 | 70.18 | 52.90 |
GPT‑4o‑mini | 24.26 | 57.27 | 40.77 |
Qwen2.5‑VL‑32B | 24.92 | 53.92 | 39.42 |
Qwen2.5‑VL‑7B | 19.04 | 57.35 | 38.19 |
Random (GPT‑4o) | 4.58 | 1.63 | 3.10 |
Model | Reasoning (Before / After) | Descriptive (Before / After) | Average (Before / After) |
---|---|---|---|
LLaVA‑Next‑Llama3‑8B | 4.74 / 16.50 | 17.16 / 46.65 | 10.95 / 31.58 |
MiniCPM‑V2.6 | 15.15 / 18.14 | 39.95 / 52.21 | 27.53 / 35.17 |
Phi‑3‑Vision | 21.65 / 29.49 | 41.18 / 59.31 | 31.41 / 44.40 |
Qwen2.5‑VL‑7B | 19.04 / 35.38 | 57.35 / 66.34 | 38.19 / 50.86 |
@inproceedings{yang2025effective,
title={Effective Training Data Synthesis for Improving MLLM Chart Understanding},
author={Yang, Yuwei and Zhang, Zeyu and Hou, Yunzhong and Li, Zhuowan and Liu, Gaowen and Payani, Ali and Ting, Yuan-Sen and Zheng, Liang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}