Keyphrase generation is the task consisting in generating a set of words or phrases that highlight the main topics of a document. There are few datasets for keyphrase generation in the biomedical domain and they do not meet the expectations in terms of size for training generative models. In this paper, we introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset with more than 5M documents collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation. The dataset is available under CC-BY-NC v4.0 license at https://huggingface.co/ datasets/taln-ls2n/kpbiomed.
翻译:关键词句生成是一项任务,包括生成一套能突出文件主要专题的词或词组。生物医学领域的关键词生成数据集很少,而且没有达到培训基因模型的预期规模。在本文件中,我们引入了kp-biomed,这是第一个大型生物医学关键词生成数据集,包含从PubMed摘要中收集的5M多份文件。我们培训和发布若干基因化模型,并进行了一系列实验,表明使用大型数据集可以大大改善当前和不存在的关键词生成的性能。数据集在https://huggingface.co/datasts/taln-ls2n/kpiomed网站上以CC-BY-NC v4.0 许可证提供。