您的收缩器如何? 多语语言模式的单语表现如何? (How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models) - 专知论文

会员服务 ·

0

Performer · 语言模型化 · 词元分析器 · MoDELS · Performance ·

2021 年 6 月 1 日

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

翻译：您的收缩器如何? 多语语言模式的单语表现如何?

Phillip Rust,Jonas Pfeiffer,Ivan Vulić,Sebastian Ruder,Iryna Gurevych

from arxiv, ACL 2021

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

翻译：在这项工作中,我们系统地、全面地比较经过训练的多语文模式与单一语文模式的单一语文模式,比较其单一语文任务的业绩。我们研究一套九种类型多样的语文,在五种不同的单一语文下游任务中,研究一套容易获得的经过训练的单一语文模式。我们首先通过公平和有控制的比较,如果该语文的多语文和相应的单一语文代表之间存在差距,我们首先力求建立这种比较,然后调查任何业绩差异的原因。为了分解混杂因素,我们用单一语文和经过多种语文训练的代用品,对同一数据进行新的单语文模式的培训。我们发现,虽然训练前的数据规模是一个重要因素,但指定的单一语文代代用品在下游业绩中也起着同样重要的作用。我们的结果显示,在多语文模式词汇中充分体现的语文的表现比单一语文对应人员少得多。我们进一步发现,用专门的单一语文代用品代替原多语文代用品使几乎所有任务和语文都提高了多语文模式的下游业绩。

0

相关内容

Performer

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【斯坦福大学AI】BERT, ELMo， & GPT-2:上下文化的单词表示是怎样的?

专知会员服务

35+阅读 · 2020年3月28日

【上海交大-字节跳动】在神经机器翻译中充分利用BERT，Making the Most of BERT in NMT

【上海交大-字节跳动】在神经机器翻译中充分利用BERT，Making the Most of BERT in NMT

专知会员服务

24+阅读 · 2020年3月28日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

基于图的word2vec负采样( GNEG:Graph-Based Negative Sampling for word2vec)

基于图的word2vec负采样( GNEG:Graph-Based Negative Sampling for word2vec)

专知会员服务

40+阅读 · 2019年11月23日

【CCL 2019】如何微调BERT进行文本分类？（How to Fine-Tune BERT for Text Classification?）

【CCL 2019】如何微调BERT进行文本分类？（How to Fine-Tune BERT for Text Classification?）

专知会员服务

84+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

想在PyTorch里训练BERT，请试试Facebook跨语言模型XLM

想在PyTorch里训练BERT，请试试Facebook跨语言模型XLM

量子位

3+阅读 · 2019年6月23日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

R文本分类之RTextTools

R文本分类之RTextTools

R语言中文社区

4+阅读 · 2018年1月17日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond

Arxiv

15+阅读 · 2020年5月13日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Arxiv

5+阅读 · 2019年9月26日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

How to Fine-Tune BERT for Text Classification?

How to Fine-Tune BERT for Text Classification?

Arxiv

13+阅读 · 2019年5月14日

Multi-Task Deep Neural Networks for Natural Language Understanding

Multi-Task Deep Neural Networks for Natural Language Understanding

Arxiv

3+阅读 · 2019年1月31日

How Do Source-side Monolingual Word Embeddings Impact Neural Machine Translation?

Arxiv

5+阅读 · 2018年6月5日

Unsupervised Machine Translation Using Monolingual Corpora Only

Arxiv

5+阅读 · 2018年4月13日

Word Translation Without Parallel Data

Arxiv

7+阅读 · 2018年1月30日

Knowledge-based Word Sense Disambiguation using Topic Models

Arxiv

5+阅读 · 2018年1月5日

VIP会员

文章信息

相关主题

语言模型化

词元分析器

相关VIP内容

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【斯坦福大学AI】BERT, ELMo， & GPT-2:上下文化的单词表示是怎样的?

专知会员服务

35+阅读 · 2020年3月28日

【上海交大-字节跳动】在神经机器翻译中充分利用BERT，Making the Most of BERT in NMT

【上海交大-字节跳动】在神经机器翻译中充分利用BERT，Making the Most of BERT in NMT

专知会员服务

24+阅读 · 2020年3月28日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

基于图的word2vec负采样( GNEG:Graph-Based Negative Sampling for word2vec)

基于图的word2vec负采样( GNEG:Graph-Based Negative Sampling for word2vec)

专知会员服务

40+阅读 · 2019年11月23日

【CCL 2019】如何微调BERT进行文本分类？（How to Fine-Tune BERT for Text Classification?）

【CCL 2019】如何微调BERT进行文本分类？（How to Fine-Tune BERT for Text Classification?）

专知会员服务

84+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《人工智能绝不能完全自主》

《人工智能的法律与伦理：军事自主机器独特挑战的深度剖析》316页

从数据到主导：AI与兵棋推演构筑决策优势

《特洛伊木马货柜：武器化集装箱的战略威胁》最新报告

相关资讯

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

想在PyTorch里训练BERT，请试试Facebook跨语言模型XLM

想在PyTorch里训练BERT，请试试Facebook跨语言模型XLM

量子位

3+阅读 · 2019年6月23日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

R文本分类之RTextTools

R文本分类之RTextTools

R语言中文社区

4+阅读 · 2018年1月17日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

相关论文

Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond

Arxiv

15+阅读 · 2020年5月13日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Arxiv

5+阅读 · 2019年9月26日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

How to Fine-Tune BERT for Text Classification?

How to Fine-Tune BERT for Text Classification?

Arxiv

13+阅读 · 2019年5月14日

Multi-Task Deep Neural Networks for Natural Language Understanding

Multi-Task Deep Neural Networks for Natural Language Understanding

Arxiv

3+阅读 · 2019年1月31日

How Do Source-side Monolingual Word Embeddings Impact Neural Machine Translation?

Arxiv

5+阅读 · 2018年6月5日

Unsupervised Machine Translation Using Monolingual Corpora Only

Arxiv

5+阅读 · 2018年4月13日

Word Translation Without Parallel Data

Arxiv

7+阅读 · 2018年1月30日

Knowledge-based Word Sense Disambiguation using Topic Models

Arxiv

5+阅读 · 2018年1月5日

微信扫码咨询专知VIP会员