Comparing document semantics is one of the toughest tasks in both Natural Language Processing and Information Retrieval. To date, on one hand, the tools for this task are still rare. On the other hand, most relevant methods are devised from the statistic or the vector space model perspectives but nearly none from a topological perspective. In this paper, we hope to make a different sound. A novel algorithm based on topological persistence for comparing semantics similarity between two documents is proposed. Our experiments are conducted on a document dataset with human judges' results. A collection of state-of-the-art methods are selected for comparison. The experimental results show that our algorithm can produce highly human-consistent results, and also beats most state-of-the-art methods though ties with NLTK.
翻译:比较文件语义是自然语言处理和信息检索中最艰巨的任务之一。 一方面, 这项任务的工具仍然很少。 另一方面, 大部分相关方法都是从统计或矢量空间模型的角度设计出来的, 但从地形学的角度来说几乎没有。 我们希望在本文中制造一个不同的声音。 提议了一种基于地形学的新型算法, 以比较两种文件的语义相似性。 我们的实验是在一个文件数据集上进行的, 与人类法官的结果相提并论。 选择了一套最先进的方法来进行比较。 实验结果显示, 我们的算法可以产生高度符合人性的要求的结果, 并且也可以战胜大多数最先进的方法, 尽管它与NLTK有关。