The usage of mathematical formulas as concise representations of a document's key ideas is common practice. Correctly interpreting these formulas, by identifying mathematical symbols and extracting their descriptions, is an important task in document understanding. This paper makes the following contributions to the mathematical identifier description reading (MIDR) task: (i) introduces the Math Formula Question Answering Dataset (MFQuAD) with $7508$ annotated identifier occurrences; (ii) describes novel variations of the noun phrase ranking approach for the MIDR task; (iii) reports experimental results for the SOTA noun phrase ranking approach and our novel variations of the approach, providing problem insights and a performance baseline; (iv) provides a position on the features that make an effective dataset for the MIDR task.
翻译:翻译后的摘要:
在文档理解中,使用数学公式作为简洁的文档重要观点的表示是常见的实践。通过正确解释这些公式,即识别数学符号并提取其描述,是文档理解中的重要任务。本文对数学标识符描述阅读(MIDR)任务做出如下贡献:(i)引入 Math Formula Question Answering Dataset (MFQuAD),其中包含了7508个注释过的标识符出现次数。(ii)描述了 MIDR 任务的名词短语排名方法的新变体;(iii)报告了使用名词短语排名方法的目前最佳方法和我们的新变体的实验结果,并提供了问题洞见和性能基线;(iv)提供了对适用于 MIDR 任务的有效数据集的特征的定位。