Charts are a popular and effective form of data visualization. Chart question answering (CQA) is a task used for assessing chart comprehension, which is fundamentally different from understanding natural images. CQA requires analyzing the relationships between the textual and the visual components of a chart, in order to answer general questions or infer numerical values. Most existing CQA datasets and it models are based on simplifying assumptions that often enable surpassing human performance. In this work, we further explore the reasons behind this outcome and propose a new model that jointly learns classification and regression. Our language-vision set up with co-attention transformers captures the complex interactions between the question and the textual elements, which commonly exist in real-world charts. We validate these conclusions with extensive experiments and breakdowns on the realistic PlotQA dataset, outperforming previous approaches by a large margin, while showing competitive performance on FigureQA. Our model's edge is particularly emphasized on questions with out-of-vocabulary answers, many of which require regression. We hope that this work will stimulate further research towards solving the challenging and highly practical task of chart comprehension.
翻译:图表是一种流行和有效的数据可视化形式。 图表解答( CQA) 是一项用于评估图表理解性的任务, 与理解自然图像有根本的不同。 CQA 需要分析图表文本和视觉组成部分之间的关系, 以便回答一般问题或推推算数值。 大多数现有的 CQA 数据集及其模型基于简化的假设, 往往能够超越人类的性能。 在这项工作中, 我们进一步探索了这一结果背后的原因, 并提出了一个新的模型, 共同学习分类和回归。 我们与共同使用变压器一起建立的语文视图, 捕捉了问题与文本要素之间的复杂互动关系, 而这些互动在现实世界的图表中通常存在。 我们验证这些结论时, 大量实验和对现实的 PlotQA 数据集进行细分, 大大超过以往的方法, 在图QA 上显示竞争性的性能。 我们的模型优势特别强调有外语系答案的问题, 其中许多需要回归。 我们希望这项工作将刺激进一步研究, 以解决图表理解性强和高度实用性的任务。