SynthBio:关于人类-大赦国际合作计算文本数据集的案例研究 (SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets)

NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.

翻译：NLP 研究人员需要更多、更高质量的文本数据集。人类标签的数据集收集费用昂贵, 而从WikiBio等网络自动检索收集的数据集则噪音很大, 可能包含不想要的偏差。此外, 网络数据源往往包含在用于预设模型的数据集中, 导致培训和测试集的无意交叉污染。在这项工作中, 我们引入了一个高效的数据集曲线的新方法 : 我们使用一个大语言模型为人类代数提供种子代数, 从而将数据集从写作任务改变为编辑任务。我们使用我们的方法将 SynthBio —— 一个新的WikiBio 评估组 — 由结构化的属性列表组成, 描述虚构人物, 绘制成自然语言生物图。我们显示, 我们的虚构生物学数据集比WikiBio 更不那么吵, 而且在性别和国籍方面更加平衡。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【PAISS 2021 教程】概率散度与生成式模型，92页ppt

专知会员服务

34+阅读 · 2021年11月30日

【2020关键词提取】医学报告的关键词提取和结构化，Keyword extraction and structuralization of medical reports

专知会员服务

33+阅读 · 2020年5月2日

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

专知会员服务

37+阅读 · 2020年4月10日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日