Given the increase of publications, search for relevant papers becomes tedious. In particular, search across disciplines or schools of thinking is not supported. This is mainly due to the retrieval with keyword queries: technical terms differ in different sciences or at different times. Relevant articles might better be identified by their mathematical problem descriptions. Just looking at the equations in a paper already gives a hint to whether the paper is relevant. Hence, we propose a new approach for retrieval of mathematical expressions based on machine learning. We design an unsupervised representation learning task that combines embedding learning with self-supervised learning. Using graph convolutional neural networks we embed mathematical expression into low-dimensional vector spaces that allow efficient nearest neighbor queries. To train our models, we collect a huge dataset with over 29 million mathematical expressions from over 900,000 publications published on arXiv.org. The math is converted into an XML format, which we view as graph data. Our empirical evaluations involving a new dataset of manually annotated search queries show the benefits of using embedding models for mathematical retrieval. This work was originally published at KDD 2020.
翻译:鉴于出版物的增加,搜索相关论文变得乏味。 特别是, 不支持跨学科或思维学校的搜索。 这主要是因为关键词查询检索: 不同科学或不同时间的技术术语不同。 相关文章可能最好通过数学问题描述来识别。 只需查看文件中的方程式, 就可以提示纸面是否相关。 因此, 我们提出基于机器学习的数学表达方式检索新办法 。 我们设计了一个不受监督的演示学习任务, 将学习与自我监督的学习结合起来 。 我们使用图形共生神经网络将数学表达方式嵌入低维矢量空间, 从而允许有效的近邻查询 。 为了培训我们的模型, 我们从在arXiv.org 上出版的900,000多份出版物中收集了2,900多万个数学表达器。 数学转换成XML格式, 我们视其为图表数据 。 我们涉及人工附加注释的搜索新数据集的经验评估显示使用嵌入模型进行数学检索的好处 。 这项工作最初在 KDD 2020 发表 。