Mathematical symbols and descriptions appear in various forms across document section boundaries without explicit markup. In this paper, we present a new large-scale dataset that emphasizes extracting symbols and descriptions in scientific documents. Symlink annotates scientific papers of 5 different domains (i.e., computer science, biology, physics, mathematics, and economics). Our experiments on Symlink demonstrate the challenges of the symbol-description linking task for existing models and call for further research effort in this area. We will publicly release Symlink to facilitate future research.
翻译:数学符号和描述以不同形式出现,跨越文件的分节边界,没有明确的标记;在本文件中,我们提出一个新的大型数据集,强调科学文件中摘取符号和描述; Symlink 说明5个不同领域(即计算机科学、生物学、物理、数学和经济学)的科学论文; 我们在Symlink上进行的实验显示了现有模型符号-描述连接任务的挑战,并呼吁在这一领域开展进一步的研究努力;我们将公开公布Symlink,以促进今后的研究。