State-of-the-art keyphrase generation methods generally depend on large annotated datasets, limiting their performance in domains with limited annotated data. To overcome this challenge, we design a data-oriented approach that first identifies salient information using unsupervised corpus-level statistics, and then learns a task-specific intermediate representation based on a pre-trained language model. We introduce salient span recovery and salient span prediction as denoising training objectives that condense the intra-article and inter-article knowledge essential for keyphrase generation. Through experiments on multiple keyphrase generation benchmarks, we show the effectiveness of the proposed approach for facilitating low-resource and zero-shot keyphrase generation. We further observe that the method especially benefits the generation of absent keyphrases, approaching the performance of models trained with large training sets.
翻译:最先进的关键词生成方法一般取决于大量附加说明的数据集,限制了它们在附加说明的数据有限的领域的性能。为了克服这一挑战,我们设计了一种面向数据的方法,首先利用不受监督的物理级统计数据确定突出信息,然后根据预先培训的语言模式学习具体任务中间代表制。我们引入了显著的恢复和突出的预测,作为淡化培训目标,将关键词生成所必需的内部和跨专业知识凝聚在一起。我们通过对多个关键词生成基准的实验,展示了促进低资源和零光关键词生成的拟议方法的有效性。我们还注意到,该方法特别有利于缺少关键词的生成,接近了经过大规模培训的模型的性能。