With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization and summarization of these data has become a necessity. In this research we present a novel algorithm for keyword identification, i.e., an extraction of one or multi-word phrases representing key aspects of a given document, called Transformer-based Neural Tagger for Keyword IDentification (TNT-KID). By adapting the transformer architecture for a specific task at hand and leveraging language model pretraining on a domain specific corpus, the model is capable of overcoming deficiencies of both supervised and unsupervised state-of-the-art approaches to keyword extraction by offering competitive and robust performance on a variety of different datasets while requiring only a fraction of manually labeled data required by the best performing systems. This study also offers thorough error analysis with valuable insights into the inner workings of the model and an ablation study measuring the influence of specific components of the keyword identification workflow on the overall performance.
翻译:随着可用文本数据数量的不断增加,有必要发展能够自动分析、分类和归纳这些数据的算法。在这项研究中,我们提出了一个用于关键词识别的新型算法,即提取代表某一文件关键方面的一个或多个词句,称为“以变换器为基础的神经图格”,用于关键词识别(TNT-KID)。通过调整变压器结构以适应手头的具体任务,并利用特定域体的语言示范培训前训练,该模型能够克服关键词提取方法的缺陷,通过在各种不同的数据集上提供有竞争力和强力的性能,同时只需要最佳操作系统所需的人工标签数据的一部分。这项研究还提供了彻底的错误分析,对模型的内部工作进行了宝贵的洞察,并进行了测量关键词识别工作流程具体组成部分对整个性能的影响的模拟研究。