Transformer-based architectures in natural language processing force input size limits that can be problematic when long documents need to be processed. This paper overcomes this issue for keyphrase extraction by chunking the long documents while keeping a global context as a query defining the topic for which relevant keyphrases should be extracted. The developed system employs a pre-trained BERT model and adapts it to estimate the probability that a given text span forms a keyphrase. We experimented using various context sizes on two popular datasets, Inspec and SemEval, and a large novel dataset. The presented results show that a shorter context with a query overcomes a longer one without the query on long documents.
翻译:在需要处理长文件时,自然语言处理器输入大小限制中可能存在问题的基于变换器的自然语言输入体结构。本文件克服了关键词提取的问题,将长文件填成块,同时保留一个全球背景,作为界定相关关键词应当提取的专题的查询。发达的系统使用预先培训的BERT模型,并对其进行调整,以估计某一文本跨出一个关键词的概率。我们在两个流行的数据集(Inspect和SemEval)和大型新数据集(SemEval)上试验了不同的上下文尺寸。 所介绍的结果显示,一个较短的带查询的上下文克服了一个较长的,而不对长文件进行查询。