We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.
翻译:我们介绍Bloom图书馆,这是一套语言多样化的多式多语种和多语种数据集,用于语言建模、图像字幕、视觉故事说明和语音合成/识别。这些数据集代表了包括的每一个下游任务中最多或最多多多语种数据集。总体而言,Bloom图书馆数据集的初步发布涵盖32个语言家庭363种语言。我们为数据中代表的各种语言培训了下游任务模型,展示了数据的可行性,以便今后在低资源、多语种NLP中开展工作,并为这些下游任务建立了第一个已知的基线(例如,Bisu [bzi],估计有700个用户)。其中一些首个他们自己的基线与资源更高的语言的最新性能相当。Bloom图书馆数据集是根据Huging Face数据集中心创意公用许可证发布的,以便在包括下游任务在内的下游任务中促进语言多样性研究。