Bloom 图书馆:以300+语言制作的多式数据集,供下游任务组合使用 (Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks) - 专知论文

会员服务 ·

0

多峰值 · 数据集 · 多样性 · 基准 · state-of-the-art ·

2022 年 10 月 26 日

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

翻译：Bloom 图书馆:以300+语言制作的多式数据集,供下游任务组合使用

Colin Leong,Joshua Nemecek,Jacob Mansdorfer,Anna Filighera,Abraham Owodunni,Daniel Whitenack

from arxiv, 14 pages, 1 figure, 3 tables, accepted to and presented at EMNLP 2022

We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.

翻译：我们介绍Bloom图书馆,这是一套语言多样化的多式多语种和多语种数据集,用于语言建模、图像字幕、视觉故事说明和语音合成/识别。这些数据集代表了包括的每一个下游任务中最多或最多多多语种数据集。总体而言,Bloom图书馆数据集的初步发布涵盖32个语言家庭363种语言。我们为数据中代表的各种语言培训了下游任务模型,展示了数据的可行性,以便今后在低资源、多语种NLP中开展工作,并为这些下游任务建立了第一个已知的基线(例如,Bisu [bzi],估计有700个用户)。其中一些首个他们自己的基线与资源更高的语言的最新性能相当。Bloom图书馆数据集是根据Huging Face数据集中心创意公用许可证发布的,以便在包括下游任务在内的下游任务中促进语言多样性研究。

0

相关内容

多峰值

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

中国图象图形学学会CSIG

0+阅读 · 2021年11月3日

【ICIG2021】Latest News & Announcements of the Plenary Talk2

【ICIG2021】Latest News & Announcements of the Plenary Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年11月2日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

三维集成扰流式散热微流道与TSV力-电耦合作用机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

半导体衬底上FeSe薄膜的外延生长及界面超导

国家自然科学基金

0+阅读 · 2013年12月31日

畜粪蚯蚓反应器中腐殖质形成及重金属迁移转化的关联特征研究

国家自然科学基金

0+阅读 · 2012年12月31日

镧系元素掺杂二氧化铪和超薄二氧化铪的总剂量效应研究

国家自然科学基金

0+阅读 · 2012年12月31日

强场隧穿电子再碰撞动力学研究

国家自然科学基金

0+阅读 · 2012年12月31日

RI与Angiogenin相互作用调控PI3K/AKT/mTOR信号通路和ANG的核转位在膀胱癌发生发展中的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

催化型氮杂Wittig反应合成多取代杂环的新方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

面向ETO产品大批量定制的过程规划与控制方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

RGC-32参与TGF-β#35825;导肾小管上皮向间充质细胞转化的分子调控机制

国家自然科学基金

0+阅读 · 2008年12月31日

Towards mapping the contemporary art world with ArtLM: an art-specific NLP model

Arxiv

0+阅读 · 2022年12月14日

Learning and Predicting Multimodal Vehicle Action Distributions in a Unified Probabilistic Model Without Labels

Arxiv

0+阅读 · 2022年12月14日

Realistic Modeling of Human Timings for Wearable Cognitive Assistance

Arxiv

0+阅读 · 2022年12月12日

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Arxiv

0+阅读 · 2022年12月12日

Searching for Effective Multilingual Fine-Tuning Methods: A Case Study in Summarization

Arxiv

0+阅读 · 2022年12月12日

A Study of Slang Representation Methods

Arxiv

0+阅读 · 2022年12月11日

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Arxiv

0+阅读 · 2022年12月11日

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Arxiv

28+阅读 · 2022年6月8日

Attention Bottlenecks for Multimodal Fusion

Arxiv

31+阅读 · 2021年6月30日

Adversarial Multimodal Representation Learning for Click-Through Rate Prediction

Arxiv

23+阅读 · 2020年3月7日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《城市滨海地区：理解复杂多变环境下的指挥控制框架》50页报告

《理解城市战及其在俄乌战争中的表现》报告

美空军“顶点2025”实验：推进AI在C2、动态目标锁定与联盟集成中的应用

《建设式兵棋模拟作为战术集群配置优化的关键组成部分》

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

中国图象图形学学会CSIG

0+阅读 · 2021年11月3日

【ICIG2021】Latest News & Announcements of the Plenary Talk2

【ICIG2021】Latest News & Announcements of the Plenary Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年11月2日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Towards mapping the contemporary art world with ArtLM: an art-specific NLP model

Arxiv

0+阅读 · 2022年12月14日

Learning and Predicting Multimodal Vehicle Action Distributions in a Unified Probabilistic Model Without Labels

Arxiv

0+阅读 · 2022年12月14日

Realistic Modeling of Human Timings for Wearable Cognitive Assistance

Arxiv

0+阅读 · 2022年12月12日

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Arxiv

0+阅读 · 2022年12月12日

Searching for Effective Multilingual Fine-Tuning Methods: A Case Study in Summarization

Arxiv

0+阅读 · 2022年12月12日

A Study of Slang Representation Methods

Arxiv

0+阅读 · 2022年12月11日

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Arxiv

0+阅读 · 2022年12月11日

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Arxiv

28+阅读 · 2022年6月8日

Attention Bottlenecks for Multimodal Fusion

Arxiv

31+阅读 · 2021年6月30日

Adversarial Multimodal Representation Learning for Click-Through Rate Prediction

Arxiv

23+阅读 · 2020年3月7日

相关基金

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

三维集成扰流式散热微流道与TSV力-电耦合作用机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

半导体衬底上FeSe薄膜的外延生长及界面超导

国家自然科学基金

0+阅读 · 2013年12月31日

畜粪蚯蚓反应器中腐殖质形成及重金属迁移转化的关联特征研究

国家自然科学基金

0+阅读 · 2012年12月31日

镧系元素掺杂二氧化铪和超薄二氧化铪的总剂量效应研究

国家自然科学基金

0+阅读 · 2012年12月31日

强场隧穿电子再碰撞动力学研究

国家自然科学基金

0+阅读 · 2012年12月31日

RI与Angiogenin相互作用调控PI3K/AKT/mTOR信号通路和ANG的核转位在膀胱癌发生发展中的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

催化型氮杂Wittig反应合成多取代杂环的新方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

面向ETO产品大批量定制的过程规划与控制方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

RGC-32参与TGF-β#35825;导肾小管上皮向间充质细胞转化的分子调控机制

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员