NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.
翻译:NLP 研究人员需要更多、更高质量的文本数据集。 人类标签的数据集收集费用昂贵, 而从WikiBio等网络自动检索收集的数据集则噪音很大, 可能包含不想要的偏差。 此外, 网络数据源往往包含在用于预设模型的数据集中, 导致培训和测试集的无意交叉污染。 在这项工作中, 我们引入了一个高效的数据集曲线的新方法 : 我们使用一个大语言模型为人类代数提供种子代数, 从而将数据集从写作任务改变为编辑任务 。 我们使用我们的方法将 SynthBio —— 一个新的WikiBio 评估组 — 由结构化的属性列表组成, 描述虚构人物, 绘制成自然语言生物图。 我们显示, 我们的虚构生物学数据集比WikiBio 更不那么吵, 而且在性别和国籍方面更加平衡 。