争取实现科学出版物的数学自动分类和类似搜索:数学内容表述方法 (Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations)

In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification and using the standard precision/recall/F1-measure metrics. The results give insight into how different math representations may influence the performance of the classification and similarity search tasks in STEM repositories. Non-surprisingly, machine learning methods are able to grab distributional semantics from textual tokens. A proper selection of weighted tokens representing math may improve the quality of the results slightly. A structured math representation that imitates successful text-processing techniques with math is shown to yield better results than flat TeX tokens.

翻译：在本文中,我们用标准的机器学习算法调查适用于STEM文件自动分类和类似搜索的数学内容表达方式:Lentant Dirichlet分配(LDA)和Lentn Semantic索引(LSI),在arXiv.org文件的一个子集上评估方法,将数学主题分类(MSC)作为一种参考分类,并使用标准的精确/回召/F1计量尺度。结果揭示了不同的数学表达方式如何影响STEM仓库的分类和类似搜索任务的执行。机器学习方法能够从文本符号中获取分布性语义。适当选择代表数学的加权符号可以稍微提高结果的质量。模拟数学成功文本处理技术的结构性数学表现方式比平坦的TeX标记效果更好。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【如何做研究】How to research ，22页ppt

专知会员服务

113+阅读 · 2021年4月17日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日