生物医学关键词生成大型数据集 (A Large-Scale Dataset for Biomedical Keyphrase Generation)

Keyphrase generation is the task consisting in generating a set of words or phrases that highlight the main topics of a document. There are few datasets for keyphrase generation in the biomedical domain and they do not meet the expectations in terms of size for training generative models. In this paper, we introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset with more than 5M documents collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation. The dataset is available under CC-BY-NC v4.0 license at https://huggingface.co/ datasets/taln-ls2n/kpbiomed.

翻译：关键词句生成是一项任务,包括生成一套能突出文件主要专题的词或词组。生物医学领域的关键词生成数据集很少,而且没有达到培训基因模型的预期规模。在本文件中,我们引入了kp-biomed,这是第一个大型生物医学关键词生成数据集,包含从PubMed摘要中收集的5M多份文件。我们培训和发布若干基因化模型,并进行了一系列实验,表明使用大型数据集可以大大改善当前和不存在的关键词生成的性能。数据集在https://huggingface.co/datasts/taln-ls2n/kpiomed网站上以CC-BY-NC v4.0 许可证提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日