Multimodal Large Language Models (MLLMs) demonstrate impressive image understanding and generating capabilities. However, existing benchmarks employ limited charts that deviate from real-world scenarios, posing challenges in accurately assessing the chart comprehension of MLLMs. To overcome this constraint, we propose ChartBench, an exhaustive chart benchmark specifically designed to evaluate MLLMs' chart comprehension and data reliability through complex visual reasoning. ChartBench encompasses a wide spectrum, including 42 categories, 2.1K charts, and 16.8K question-answer pairs. Diverging from previous benchmarks, ChartBench avoids employing data point annotation charts or metadata prompts directly. Instead, it compels MLLMs to derive values akin to human understanding by leveraging inherent chart elements such as color, legends, or coordinate systems. Additionally, we propose an enhanced evaluation metric, Acc+, which facilitates the evaluation of MLLMs without needing labor-intensive manual efforts or costly evaluations based on GPT. Our extensive experimental evaluation involves 12 widely-used open-sourced and 2 proprietary MLLMs, revealing the limitations of MLLMs in interpreting charts and providing valuable insights to encourage closer scrutiny of this aspect.
翻译:暂无翻译