In this work, we explore how to learn task-specific language models aimed towards learning rich representation of keyphrases from text documents. We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), showing large gains in performance (upto 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq format, instead of the denoised original input. This also led to gains in performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other fundamental NLP tasks.
翻译:在这项工作中,我们探索如何学习具体任务语言模型,目的是从文本文档中学习大量关键词的丰富表述。我们实验了在歧视性和基因化环境中对培训前变压器语言模型(LMS)的不同遮罩战略。在歧视环境中,我们引入了新的培训前目标----关键词“边界填充替换”(KBIR),这比SOTA(SOTA)的表现(F1中最高为9.26点)大有收益,因为LM先培训的KBIR(KBIR)为关键词提取任务作了微调。在基因化设置中,我们为BART - KeyBART(BART)引入了新的培训前设置,该设置复制了与CatSeq格式输入文本有关的关键词,而不是取消原有的投入。这还导致在SOTA(F1@M)比SOTA生成的绩效(最高4.33点)。此外,我们还微调了预先培训前使用KBIR(NER)、问题解答(QA)、关系提取(RE)、抽象总结、抽象总结以及实现与SATATATA的其他基本表现。