Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amongst the most efficient ones, while maintaining the retrieval performance. One of the main properties, common to graph-based methods, is their immediate conversion of token space into graphs, followed by subsequent processing. In this paper, we explore a novel unsupervised approach which merges parts of a document in sequential form, prior to construction of the token graph. Further, by leveraging personalized PageRank, which considers frequencies of such sub-phrases alongside token lengths during node ranking, we demonstrate state-of-the-art retrieval capabilities while being up to two orders of magnitude faster than current state-of-the-art unsupervised detectors such as YAKE and MultiPartiteRank. The proposed method's scalability was also demonstrated by computing keyphrases for a biomedical corpus comprised of 14 million documents in less than a minute.
翻译:高效识别代表给定文档的关键词句是一项艰巨的任务。 在过去几年中, 提出了大量关键词检测方法。 这些方法可以基于符号、 专门的神经语言模型或源自给定文档的图形结构的统计( 频率) 属性。 基于图形的方法可以在保持检索性能的同时, 在效率最高的方法中进行计算。 主要特性之一, 与基于图形的方法相同, 是它们立即将象征性空间转换成图表, 并随后进行处理。 在本文中, 我们探索了一种新的、 未经监督的方法, 它将文档的部件以顺序形式合并, 如符号图的构造之前。 此外, 通过利用个性化的 PageRank, 在节点排序期间将这种子句的频率与符号长度并用, 我们展示了最先进的检索能力, 并且比当前最先进的不超超超超的探测器, 如 YAake 和 MulPartiteRank 等 。 拟议的方法的可缩略性也通过计算低于 14 分钟文件的关键词句的缩略性来演示。