The BigScience Workshop was a value-driven initiative that spanned one and half years of interdisciplinary research and culminated in the creation of ROOTS, a 1.6TB multilingual dataset that was used to train BLOOM, one of the largest multilingual language models to date. In addition to the technical outcomes and artifacts, the workshop fostered multidisciplinary collaborations around large models, datasets, and their analysis. This in turn led to a wide range of research publications spanning topics from ethics to law, data governance, modeling choices and distributed training. This paper focuses on the collaborative research aspects of BigScience and takes a step back to look at the challenges of large-scale participatory research, with respect to participant diversity and the tasks required to successfully carry out such a project. Our main goal is to share the lessons we learned from this experience, what we could have done better and what we did well. We show how the impact of such a social approach to scientific research goes well beyond the technical artifacts that were the basis of its inception.
翻译:大科学讲习班是一个价值驱动的倡议,为期一年半的跨学科研究,最终产生了ROOTS,这是用于培训BLOOM的1.6TB多语言数据集,是迄今为止最大的多语言模型之一,除了技术成果和人工制品外,讲习班还围绕大型模型、数据集及其分析促进多学科合作,这反过来又导致一系列广泛的研究出版物,涉及从伦理到法律、数据治理、建模选择和分布式培训等主题。本文侧重于大科学的合作研究方面,并退后一步,研究大规模参与性研究的挑战,涉及参与者的多样性和成功开展这一项目所需的任务。我们的主要目标是分享我们从这一经验中汲取的教训、我们可以做得更好和我们做得好的东西。我们展示了这种社会科学研究方法的影响如何远远超出作为初创基础的技术手工艺。