BERT模型中培训数据集和字典大小问题:波罗的海语言案例 (Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages) - 专知论文

会员服务 ·

0

MoDELS · BERT · CASE · 掩码语言模型化 · entity ·

2021 年 12 月 20 日

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

翻译：BERT模型中培训数据集和字典大小问题:波罗的海语言案例

Matej Ulčar,Marko Robnik-Šikonja

from arxiv, 12 pages. To be published in proceedings of the AIST 2021 conference

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.

翻译：研究显示,单语模式比多语种模式产生的结果要好,但培训数据集必须足够大。我们为立陶宛、拉脱维亚和英语培训了一个类似三种语言的LitLat BERT模式,为爱沙尼亚培训了一个单一语言的Est-ROBERTA模式。我们评估了它们在四项下游任务方面的表现:名称实体识别、依赖分析、部分语音标记和词类比。为了分析注重单一语言的重要性和大型培训集的重要性,我们将创建的模式与现有的爱沙尼亚、拉脱维亚和立陶宛的单语和多语言的BERT模式进行比较。结果显示,新建的LitLat BERT和Est-ROBERTA模式改善了大多数情况下所有测试任务的现有模式的结果。

0

相关内容

MoDELS

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【Google-Thang】最新《语言预训练语生成进展》67页ppt，Language Pretraining

【Google-Thang】最新《语言预训练语生成进展》67页ppt，Language Pretraining

专知会员服务

24+阅读 · 2020年9月15日

【Google】平滑对抗训练，Smooth Adversarial Training

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

专知会员服务

51+阅读 · 2020年5月3日

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

专知会员服务

60+阅读 · 2020年3月14日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【清华大学】知识增强的常识性故事生成预训练模型，A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

【清华大学】知识增强的常识性故事生成预训练模型，A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

专知会员服务

52+阅读 · 2020年1月20日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

【O'Reilly AI Conference 2019】应用NLP的医疗科技：功能工程与模型诊断（NLP for healthcare: Feature engineering and model diagnostics），美国医疗保健公司Episource，Manas Ranjan Kar

【O'Reilly AI Conference 2019】应用NLP的医疗科技：功能工程与模型诊断（NLP for healthcare: Feature engineering and model diagnostics），美国医疗保健公司Episource，Manas Ranjan Kar

专知会员服务

8+阅读 · 2019年11月6日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

【Github】BERT-NER-Pytorch：三种不同模式的BERT中文NER实验

【Github】BERT-NER-Pytorch：三种不同模式的BERT中文NER实验

AINLP

14+阅读 · 2020年1月6日

使用BERT做文本摘要

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

Short-answer scoring with ensembles of pretrained language models

Arxiv

0+阅读 · 2022年2月23日

Masked prediction tasks: a parameter identifiability view

Arxiv

0+阅读 · 2022年2月18日

Pretrained Transformers Improve Out-of-Distribution Robustness

Arxiv

5+阅读 · 2020年4月13日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Arxiv

3+阅读 · 2020年3月24日

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Arxiv

14+阅读 · 2019年6月19日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Multi-task learning to improve natural language understanding

Arxiv

4+阅读 · 2018年12月17日

Joint Training for Neural Machine Translation Models with Monolingual Data

Arxiv

4+阅读 · 2018年3月1日

Investigations on Knowledge Base Embedding for Relation Prediction and Extraction

Arxiv

8+阅读 · 2018年2月6日

VIP会员

文章信息

相关主题

掩码语言模型化

相关VIP内容

【Google-Thang】最新《语言预训练语生成进展》67页ppt，Language Pretraining

【Google-Thang】最新《语言预训练语生成进展》67页ppt，Language Pretraining

专知会员服务

24+阅读 · 2020年9月15日

【Google】平滑对抗训练，Smooth Adversarial Training

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

专知会员服务

51+阅读 · 2020年5月3日

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

专知会员服务

60+阅读 · 2020年3月14日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【清华大学】知识增强的常识性故事生成预训练模型，A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

【清华大学】知识增强的常识性故事生成预训练模型，A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

专知会员服务

52+阅读 · 2020年1月20日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

【O'Reilly AI Conference 2019】应用NLP的医疗科技：功能工程与模型诊断（NLP for healthcare: Feature engineering and model diagnostics），美国医疗保健公司Episource，Manas Ranjan Kar

【O'Reilly AI Conference 2019】应用NLP的医疗科技：功能工程与模型诊断（NLP for healthcare: Feature engineering and model diagnostics），美国医疗保健公司Episource，Manas Ranjan Kar

专知会员服务

8+阅读 · 2019年11月6日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

《物联网（IoT）中的无人机通信高效控制》135页

《在GNSS信号降级环境中利用共识实现无人机集群稳健协调》

中程单向攻击无人机的战略意义：俄乌战争启示

《面向无人机集群的避障动态传感器覆盖算法》最新38页

相关资讯

【Github】BERT-NER-Pytorch：三种不同模式的BERT中文NER实验

【Github】BERT-NER-Pytorch：三种不同模式的BERT中文NER实验

AINLP

14+阅读 · 2020年1月6日

使用BERT做文本摘要

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

相关论文

Short-answer scoring with ensembles of pretrained language models

Arxiv

0+阅读 · 2022年2月23日

Masked prediction tasks: a parameter identifiability view

Arxiv

0+阅读 · 2022年2月18日

Pretrained Transformers Improve Out-of-Distribution Robustness

Arxiv

5+阅读 · 2020年4月13日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Arxiv

3+阅读 · 2020年3月24日

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Arxiv

14+阅读 · 2019年6月19日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

Multi-task learning to improve natural language understanding

Arxiv

4+阅读 · 2018年12月17日

Joint Training for Neural Machine Translation Models with Monolingual Data

Arxiv

4+阅读 · 2018年3月1日

Investigations on Knowledge Base Embedding for Relation Prediction and Extraction

Arxiv

8+阅读 · 2018年2月6日

微信扫码咨询专知VIP会员