Keyword extraction is a fundamental task in natural language processing that facilitates mapping of documents to a concise set of representative single and multi-word phrases. Keywords from text documents are primarily extracted using supervised and unsupervised approaches. In this paper, we present an unsupervised technique that uses a combination of theme-weighted personalized PageRank algorithm and neural phrase embeddings for extracting and ranking keywords. We also introduce an efficient way of processing text documents and training phrase embeddings using existing techniques. We share an evaluation dataset derived from an existing dataset that is used for choosing the underlying embedding model. The evaluations for ranked keyword extraction are performed on two benchmark datasets comprising of short abstracts (Inspec), and long scientific papers (SemEval 2010), and is shown to produce results better than the state-of-the-art systems.
翻译:关键词提取是自然语言处理中的一项基本任务,它有助于将文件映射成一套具有代表性的简明单一和多字词组。文本文件中的关键词主要是通过监督和不受监督的方法提取的。在本文中,我们介绍了一种未经监督的技术,它结合了专题加权个人化的PageRank算法和神经短语嵌入来提取和排序关键词。我们还采用了一种高效率的方法,利用现有技术处理文本文件和培训短语嵌入。我们分享了一套评价数据集,该数据集来自用于选择基本嵌入模型的现有数据集。对排名关键词提取的评估是在由短摘要(插图)和长科学论文(SemEval,2010年)组成的两个基准数据集上进行的,显示其结果优于最先进的系统。