ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces. We describe our implementation and the possible use cases of our tool.
翻译:ROOTS是为培训BLOOM而开发的1.6TB多语种文本集,目前是最大的语言模型,并有相应的数据治理努力,在继续这些努力过程中,我们介绍了ROOTS搜索工具:整个ROOTS系统的搜索引擎,提供模糊和精确的搜索能力。ROOTS是迄今为止能够以这种方式调查的最大材料集。ROOTS搜索工具是公开来源的,可在载体空间上查阅。我们描述了我们工具的实施情况和可能使用的案例。</s>