BIO: 数据中心生物医学自然语言处理框架 (BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing)

Jason Alan Fries,Leon Weber,Natasha Seelam,Gabriel Altay,Debajyoti Datta,Samuele Garda,Myungsun Kang,Ruisi Su,Wojciech Kusa,Samuel Cahyawijaya,Fabio Barth,Simon Ott,Matthias Samwald,Stephen Bach,Stella Biderman,Mario Sänger,Bo Wang,Alison Callahan,Daniel León Periñán,Théo Gigant,Patrick Haller,Jenny Chim,Jose David Posada,John Michael Giorgi,Karthik Rangasai Sivaraman,Marc Pàmies,Marianna Nezhurina,Robert Martin,Michael Cullan,Moritz Freidank,Nathan Dahlberg,Shubhanshu Mishra,Shamik Bose,Nicholas Michio Broad,Yanis Labrak,Shlok S Deshmukh,Sid Kiblawi,Ayush Singh,Minh Chien Vu,Trishala Neeraj,Jonas Golde,Albert Villanova del Moral,Benjamin Beilharz

from arxiv, Submitted to NeurIPS 2022 Datasets and Benchmarks Track

Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBIO a community library of 126+ biomedical NLP datasets, currently covering 12 task categories and 10+ languages. BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBIO is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical

翻译：语言模式的培训和评价日益要求建立元数据集 -- -- 不同收集具有明确出处的成熟数据。自然语言促进者最近通过将现有的、受监督的数据集转化为多样化的新培训前任务,突出元数据集曲线的好处,改善了零光化。虽然在一般文本中取得了成功,但将这些以数据为中心的方法转换为生物医学语言模型仍然具有挑战性,因为贴标签的生物医学数据集在大众数据中心中的代表性严重不足。为了应对这一挑战,我们引入了BigBIO,这是一个由126+生物医学NLP数据集组成的社区图书馆,目前覆盖12个任务类别和10个以上语言。BigBIO通过方案访问数据集及其元数据,促进可复制的元数据集曲线,并与当前快速工程和最终完成的少数/零点语言模型评价平台相兼容。我们讨论了任务系统协调、数据审计、贡献指南和大纲两个示例使用案例:生物医学快速和大规模、多任务学习的零弹道评估。BigBIO是一个持续不断的社区努力,可在 http://magirmascial/scialshistalshopment.