小数据量语言模型的 MiniPile 挑战 (The MiniPile Challenge for Data-Efficient Language Models) - 专知论文

会员服务 ·

0

语料库 · 语料 · 语言模型 · 预训练 · 小数据 ·

2023 年 4 月 17 日

The MiniPile Challenge for Data-Efficient Language Models

翻译：小数据量语言模型的 MiniPile 挑战

The ever-growing diversity of pre-training text corpora has equipped language models with generalization capabilities across various downstream tasks. However, such diverse datasets are often too large for academic budgets; hence, most research on Transformer architectures, training procedures, optimizers, etc. gets conducted on smaller, homogeneous datasets. To this end, we present The MiniPile Challenge, where one pre-trains a language model on a diverse text corpus containing at most 1M documents. MiniPile is a 6GB subset of the deduplicated 825GB The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using $k$-means, and (3) filter out low-quality clusters. To verify MiniPile's suitability for language model pre-training, we use it to pre-train a BERT and T5 model, yielding a performance drop of only $1.9\%$/$2.5\%$ on the GLUE and SNI benchmarks compared to the original pre-trained checkpoints trained on $2.6$x/$745$x the amount of data. MiniPile is available at https://huggingface.co/datasets/JeanKaddour/minipile.

翻译：随着预训练文本语料库的不断增加，语言模型的泛化能力越来越强，适用于各种下游任务。然而，这些各异的数据集往往对于学术预算而言过于庞大；因此，大多数有关 Transformer 结构、训练过程、优化器等的研究都是在更小、同质的数据集上进行的。针对这一问题，我们提出了 MiniPile 挑战，即在包含最多 100 万个文档的多样化文本语料库上对语言模型进行预训练。MiniPile 是巨大的 The Pile 语料库中的一个包含 6 GB 数据的子集。我们使用简单的三步数据筛选过程对 MiniPile 进行了筛选：我们（1）对 The Pile 的所有文档推断出嵌入，（2）使用 k 均值聚类嵌入空间，（3）过滤掉低质量聚类。为了验证 MiniPile 对于语言模型预训练的适用性，我们使用它对 BERT 和 T5 模型进行了预训练，在 GLUE 和 SNI 基准测试中相对于原始预训练检查点的训练数据量减少了 2.6x / 745x，最终的性能下降仅为 1.9%/2.5%。MiniPile 数据集可在 https://huggingface.co/datasets/JeanKaddour/minipile 上获得。

0

相关内容

语料库

语料库是语料库语言学研究的基础资源，也是经验主义语言研究方法的主要资源。应用于词典编纂，语言教学，传统语言研究，自然语言处理中基于统计或实例的研究等方面。

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【Google】高效Transformer综述，Efficient Transformers: A Survey

【Google】高效Transformer综述，Efficient Transformers: A Survey

专知会员服务

66+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

专知会员服务

139+阅读 · 2020年7月10日

1750亿参数！GPT-3来了！31位作者，OpenAI发布小样本学习器语言模型

1750亿参数！GPT-3来了！31位作者，OpenAI发布小样本学习器语言模型

专知会员服务

73+阅读 · 2020年5月30日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

专知会员服务

70+阅读 · 2020年1月17日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

自然语言处理 (NLP)资源大全

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

删失数据超高维共线性模型的变量选择

国家自然科学基金

0+阅读 · 2017年12月31日

新型小分子化合物Me6促进辐射损伤肠组织再生修复的研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于Fermi-LAT和AMS-02的暗物质理论研究

国家自然科学基金

0+阅读 · 2013年12月31日

有限半群与半群簇

国家自然科学基金

1+阅读 · 2013年12月31日

高性能三维吸波式频率选择表面研究

国家自然科学基金

0+阅读 · 2013年12月31日

TRPC4在内皮祖细胞增殖分化及血管损伤修复中的作用

国家自然科学基金

0+阅读 · 2012年12月31日

链路特性互异下多用户OFDM/SDMA系统联合信道估计和多用户检测算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

高效中红外激光晶体Cr,Er,Re:YSGG（Re＝Eu3+, Tb3+）的生长及性能研究

国家自然科学基金

0+阅读 · 2012年12月31日

家蚕geminin基因对丝腺细胞高效核内有丝分裂的调控研究

国家自然科学基金

0+阅读 · 2012年12月31日

低维与异维结构量子调控及器件原理研究

国家自然科学基金

0+阅读 · 2011年12月31日

UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

Arxiv

0+阅读 · 2023年6月1日

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Arxiv

0+阅读 · 2023年6月1日

Bag of Tricks for Training Data Extraction from Language Models

Arxiv

0+阅读 · 2023年6月1日

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Arxiv

0+阅读 · 2023年5月31日

VIPriors 3: Visual Inductive Priors for Data-Efficient Deep Learning Challenges

Arxiv

0+阅读 · 2023年5月31日

Fine-grained Text Style Transfer with Diffusion-Based Language Models

Arxiv

0+阅读 · 2023年5月31日

Scaling Data-Constrained Language Models

Arxiv

0+阅读 · 2023年5月30日

Scalable Performance Analysis for Vision-Language Models

Arxiv

0+阅读 · 2023年5月30日

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Arxiv

12+阅读 · 2021年12月30日

Making Pre-trained Language Models Better Few-shot Learners

Arxiv

14+阅读 · 2020年12月31日

VIP会员

文章信息

相关主题

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【Google】高效Transformer综述，Efficient Transformers: A Survey

【Google】高效Transformer综述，Efficient Transformers: A Survey

专知会员服务

66+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

专知会员服务

139+阅读 · 2020年7月10日

1750亿参数！GPT-3来了！31位作者，OpenAI发布小样本学习器语言模型

1750亿参数！GPT-3来了！31位作者，OpenAI发布小样本学习器语言模型

专知会员服务

73+阅读 · 2020年5月30日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

专知会员服务

70+阅读 · 2020年1月17日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《毁灭算法：解析以色列在加沙的AI军事行动》

【COLT 2025最新教程】语言生成

以机器速度锁定目标：人工智能的能力与局限

【ICML2025】通过在线世界模型规划的持续强化学习

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

自然语言处理 (NLP)资源大全

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

相关论文

UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

Arxiv

0+阅读 · 2023年6月1日

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Arxiv

0+阅读 · 2023年6月1日

Bag of Tricks for Training Data Extraction from Language Models

Arxiv

0+阅读 · 2023年6月1日

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Arxiv

0+阅读 · 2023年5月31日

VIPriors 3: Visual Inductive Priors for Data-Efficient Deep Learning Challenges

Arxiv

0+阅读 · 2023年5月31日

Fine-grained Text Style Transfer with Diffusion-Based Language Models

Arxiv

0+阅读 · 2023年5月31日

Scaling Data-Constrained Language Models

Arxiv

0+阅读 · 2023年5月30日

Scalable Performance Analysis for Vision-Language Models

Arxiv

0+阅读 · 2023年5月30日

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Arxiv

12+阅读 · 2021年12月30日

Making Pre-trained Language Models Better Few-shot Learners

Arxiv

14+阅读 · 2020年12月31日

相关基金

删失数据超高维共线性模型的变量选择

国家自然科学基金

0+阅读 · 2017年12月31日

新型小分子化合物Me6促进辐射损伤肠组织再生修复的研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于Fermi-LAT和AMS-02的暗物质理论研究

国家自然科学基金

0+阅读 · 2013年12月31日

有限半群与半群簇

国家自然科学基金

1+阅读 · 2013年12月31日

高性能三维吸波式频率选择表面研究

国家自然科学基金

0+阅读 · 2013年12月31日

TRPC4在内皮祖细胞增殖分化及血管损伤修复中的作用

国家自然科学基金

0+阅读 · 2012年12月31日

链路特性互异下多用户OFDM/SDMA系统联合信道估计和多用户检测算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

高效中红外激光晶体Cr,Er,Re:YSGG（Re＝Eu3+, Tb3+）的生长及性能研究

国家自然科学基金

0+阅读 · 2012年12月31日

家蚕geminin基因对丝腺细胞高效核内有丝分裂的调控研究

国家自然科学基金

0+阅读 · 2012年12月31日

低维与异维结构量子调控及器件原理研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员