Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a 'Formula Concept' that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering as well as document similarity assessments for plagiarism detection or recommender systems.
翻译:引用基础的信息检索方法在学术界得到了广泛应用,如检测抄袭或文献推荐系统。在数学、科学、工程和技术领域中,研究人员经常使用数学公式来引用先前的知识。本文的长期目标是推广引用基础的信息检索方法,将其应用于经典引用和数学概念。在本文中,我们建议如何引用数学公式,并定义了公式概念检索任务及其两个子任务:公式概念发现(FCD)和公式概念识别(FCR)。虽然FCD旨在定义和探索将公式进行等价表示的公式概念,但FCR旨在将给定的公式与先前分配的唯一数学概念标识符进行匹配。我们提出了基于机器学习的方法来解决FCD和FCR的任务。然后在标准化的测试集(NTCIR arXiv 数据集)上评估了这些方法。我们的FCD方法对于检索常见公式的等价表示具有 68% 的准确度,对于从周围的文本中提取公式名称具有 72% 的召回率。FCD和FCR使得数学文档中的公式被引用,并促进了语义搜索和问答,以及用于检测抄袭或推荐系统的文档相似性评估。