数据集:用于处理自然语言的社区图书馆 (Datasets: A Community Library for Natural Language Processing)

Quentin Lhoest,Albert Villanova del Moral,Yacine Jernite,Abhishek Thakur,Patrick von Platen,Suraj Patil,Julien Chaumond,Mariama Drame,Julien Plu,Lewis Tunstall,Joe Davison,Mario Šaško,Gunjan Chhablani,Bhavitvya Malik,Simon Brandeis,Teven Le Scao,Victor Sanh,Canwen Xu,Nicolas Patry,Angelina McMillan-Major,Philipp Schmid,Sylvain Gugger,Clément Delangue,Théo Matussière,Lysandre Debut,Stas Bekman,Pierric Cistac,Thibault Goehringer,Victor Mustar,François Lagunas,Alexander M. Rush,Thomas Wolf

from arxiv, EMNLP Demo 2021

The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.

翻译：随着研究人员提出新的任务、更大的模型和新的基准,公众可公开获得的NLP数据集的规模、种类和数量迅速增加。数据集是当代NLP的一个社区图书馆,旨在支持这一生态系统。数据集旨在将终端用户界面、版本和文件标准化,同时为小型数据集提供一个与互联网规模公司相似的轻量级前端。图书馆的设计包含一种分散的、由社区驱动的方法来添加数据集和记录使用。经过一年的发展,图书馆现在包括650多个独特的数据集,有250多个用户,帮助支持各种新的交叉数据集研究项目和共同任务。图书馆可在https://github.com/huggface/dataset查阅。

相关内容

数据集

关注 86

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

NLP必读经典文献100篇

专知会员服务

123+阅读 · 2020年9月8日