In this work, we explore how to train task-specific language models aimed towards learning rich representation of keyphrases from text documents. We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), showing large gains in performance (upto 8.16 points in F1) over SOTA, when the LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq format, instead of the denoised original input. This also led to gains in performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other fundamental NLP tasks.
翻译:在这项工作中,我们探索如何培训针对特定任务的语言模型,目的是从文本文档中学习大量关键词的丰富表述。我们实验了在歧视性和基因化环境中培训前变压器语言模型的不同掩码战略。在歧视环境中,我们引入了新的培训前目标----关键词“边界填充替换”(KBIR),这在表现上比SOTA(调校前培训的KBIR(调校前培训的KBIR)要对关键词提取任务进行微调。在基因化设置中,我们为BART-KeyBART(BART)引入了新的培训前预设,该预设将CatSeq格式中与输入文本有关的关键词复制为“CatSeq”格式,而不是原有的分解式输入。这还导致与SOTA(F1@M)相比,在表现上取得了很大收益(最高为8.16分为F1@M),在SOTA(调校前培训的语文模型)上,在命名实体识别问题解答(QA(QA)、关系提取(RE),抽象总结总结和完成SOTATA(M)中的许多主要表现。