Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. We argue that current MLLMs rely largely on visual recognition rather than visual reasoning to interpret the charts, and visual estimation of numerical values is one of the most fundamental capabilities in chart understanding that require complex visual reasoning. To prove this, we introduce ChartVRBench, a benchmark meticulously designed to isolate and evaluate visual reasoning ability in chart understanding. Furthermore, we propose ChartVR-3B/7B trained with a novel Visual Reasoning Reinforcement Finetuning (VR-RFT) strategy to strengthen genuine chart visual reasoning abilities. Extensive experiments show that ChartVR achieves superior performance on ChartVRBench, outperforming even powerful proprietary models. Moreover, the visual reasoning skills cultivated by the proposed VR-RFT demonstrate strong generalization, leading to significant performance gains across a diverse suite of public chart understanding benchmarks. The code and dataset will be publicly available upon publication.
翻译:尽管多模态大语言模型(MLLMs)在图表理解任务中展现出日益卓越的性能,但大多数模型在处理未标注图表时仍表现出令人担忧的幻觉现象和显著的性能下降。我们认为,当前MLLMs主要依赖视觉识别而非视觉推理来解读图表,而对数值的视觉估计是图表理解中最基础且需要复杂视觉推理的能力之一。为验证这一观点,我们提出了ChartVRBench——一个精心设计用于分离和评估图表理解中视觉推理能力的基准。此外,我们提出通过新颖的视觉推理强化微调(VR-RFT)策略训练的ChartVR-3B/7B模型,以增强真正的图表视觉推理能力。大量实验表明,ChartVR在ChartVRBench上取得了卓越性能,甚至超越了强大的专有模型。更重要的是,通过VR-RFT培养的视觉推理技能展现出强大的泛化能力,在多种公开图表理解基准测试中均带来显著的性能提升。代码与数据集将在论文发表后公开。