Chart question answering (CQA) is a task used for assessing chart comprehension, which is fundamentally different from understanding natural images. CQA requires analyzing the relationships between the textual and the visual components of a chart, in order to answer general questions or infer numerical values. Most existing CQA datasets and models are based on simplifying assumptions that often enable surpassing human performance. In this work, we address this outcome and propose a new model that jointly learns classification and regression. Our language-vision setup uses co-attention transformers to capture the complex real-world interactions between the question and the textual elements. We validate our design with extensive experiments on the realistic PlotQA dataset, outperforming previous approaches by a large margin, while showing competitive performance on FigureQA. Our model is particularly well suited for realistic questions with out-of-vocabulary answers that require regression.
翻译:图表解答( CQA) 是用于评估图表理解性的任务, 与理解自然图像有根本的不同。 CQA 需要分析图表文本和视觉组成部分之间的关系, 以便回答一般问题或推断数字值。 大多数现有的 CQA 数据集和模型基于简化的假设, 往往能够超过人类的性能。 在这项工作中, 我们讨论这一结果, 并提出一个新的模型, 共同学习分类和回归。 我们的语言视图设置使用共同注意变压器来捕捉问题和文字元素之间复杂的真实世界互动。 我们验证我们的设计, 以对现实的 PlotQA 数据集进行大量实验, 以大差值比以往的方法表现得更好, 同时在图QA 上显示有竞争力的性能。 我们的模型特别适合现实的问题, 以及需要回归的词汇外答案。