The identification of the most significant concepts in unstructured data is of critical importance in various practical applications. Despite the large number of methods that have been put forth to extract the main topics of texts, a limited number of studies have probed the impact of the text length on the performance of keyword extraction (KE) methods. In this study, we adopted a network-based approach to evaluate whether keywords extracted from paper abstracts are compatible with keywords extracted from full papers. We employed a community detection method to identify groups of related papers in citation networks. These paper clusters were then employed to extract keywords from abstracts. Our results indicate that while the various community detection methods employed in our KE approach yielded similar levels of accuracy, a correlation analysis revealed that these methods produced distinct keyword lists for each abstract. We also observed that all considered approaches, however, reach low values of accuracy. Surprisingly, text clustering approaches outperformed all citation-based methods. The findings suggest that using different sources of information to extract keywords can lead to significant differences in performance. This effect can play an important role in applications relying upon the identification of relevant concepts.
翻译:在各种实际应用中,确定非结构化数据中最重要的概念至关重要。尽管提出了大量方法来提取案文的主要专题,但数量有限的研究已经探究了文本长度对关键词提取方法的性能的影响。在本研究中,我们采用了基于网络的方法来评价从纸面摘要中提取的关键词是否与从全文中提取的关键词兼容。我们采用了一种社区探测方法来查明引文网络中的相关文件群。然后,这些纸质组群被用来从摘要中提取关键词。我们的结果表明,虽然在我们的KE方法中使用的各种社区探测方法产生了相似的精确度,但相关的分析表明,这些方法为每个摘要生成了不同的关键词列表。但我们也注意到,所有考虑过的方法都达到了低的准确度值。令人惊讶的是,文本组合方法优于所有基于引用的方法。研究结果表明,使用不同的信息来源提取关键词可以导致显著的性能差异。这种效果在应用中可以发挥重要作用,取决于相关概念的确定。