In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification and using the standard precision/recall/F1-measure metrics. The results give insight into how different math representations may influence the performance of the classification and similarity search tasks in STEM repositories. Non-surprisingly, machine learning methods are able to grab distributional semantics from textual tokens. A proper selection of weighted tokens representing math may improve the quality of the results slightly. A structured math representation that imitates successful text-processing techniques with math is shown to yield better results than flat TeX tokens.
翻译:在本文中,我们用标准的机器学习算法调查适用于STEM文件自动分类和类似搜索的数学内容表达方式:Lentant Dirichlet分配(LDA)和Lentn Semantic索引(LSI),在arXiv.org文件的一个子集上评估方法,将数学主题分类(MSC)作为一种参考分类,并使用标准的精确/回召/F1计量尺度。结果揭示了不同的数学表达方式如何影响STEM仓库的分类和类似搜索任务的执行。机器学习方法能够从文本符号中获取分布性语义。适当选择代表数学的加权符号可以稍微提高结果的质量。模拟数学成功文本处理技术的结构性数学表现方式比平坦的TeX标记效果更好。