以字符为基础的语言模式能否改善低资源和噪音语言情景下游任务绩效? (Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?) - 专知论文

会员服务 ·

0

语言模型化 · Performer · MoDELS · Extensibility · 下游任务 ·

2021 年 10 月 26 日

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

翻译：以字符为基础的语言模式能否改善低资源和噪音语言情景下游任务绩效?

Arij Riabi,Benoît Sagot,Djamé Seddah

from arxiv, Camera ready version. Accepted to WNUT 2021

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.

翻译：最近,国家语言语言方案取得了令人印象深刻的改进,主要基于背景神经语言模式的成功,这些改进大多表现在最多几十种高资源语言上。为非标准化和低资源语言建立语言模式以及更广泛的非标准化语言和低资源语言国家语言方案系统仍然是一项艰巨的任务。在这项工作中,我们侧重于使用拉丁文字(称为NArabizi,主要见于社交媒体和信息通信)扩展书写北非口语方言阿拉伯语。在这种低资源情景中,通过显示高度变异性的数据,我们比较了部分语音标记和依赖的基于字符的语言模式的下游性能与单语和多语言模式的下游性能。我们表明,仅就NArabizi的99k句句子进行了基于字符的训练,并在这一语言的小树库中作了微调调整,其表现接近于对大型多语种和单一语言模式进行预先培训的同一架构所获得的效果。我们确认这些结果是在大量密集法语用户生成内容的数据集上,我们认为,这种基于字符语言的模型可以成为低资源和高语言变异性国家语言的财富资产。

0

相关内容

语言模型化

语言模型化

预训练语言模型fine-tuning近期进展概述

预训练语言模型fine-tuning近期进展概述

专知会员服务

36+阅读 · 2021年4月9日

【EMNLP2020-CMU&字节跳动】基于预训练语言模型的句子嵌入研究

【EMNLP2020-CMU&字节跳动】基于预训练语言模型的句子嵌入研究

专知会员服务

22+阅读 · 2020年11月14日

【CMU-TACL2020】低资源跨语言实体链接，Low-resource Crosslingual EntityLinking

专知会员服务

16+阅读 · 2020年3月29日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

49+阅读 · 2020年3月7日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

31+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

56+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

167+阅读 · 2019年10月11日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

64+阅读 · 2019年10月9日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

23+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

25+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

Facebook PyText 在 Github 上开源了

Facebook PyText 在 Github 上开源了

AINLP

7+阅读 · 2018年12月14日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey

Arxiv

30+阅读 · 2021年11月1日

Pre-trained Language Model based Ranking in Baidu Search

Arxiv

9+阅读 · 2021年6月25日

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Arxiv

3+阅读 · 2021年6月11日

PTR: Prompt Tuning with Rules for Text Classification

Arxiv

7+阅读 · 2021年5月24日

A Baseline for Few-Shot Image Classification

Arxiv

7+阅读 · 2020年3月1日

Zero-Resource Cross-Lingual Named Entity Recognition

Arxiv

5+阅读 · 2019年11月22日

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Arxiv

18+阅读 · 2019年9月25日

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Arxiv

3+阅读 · 2019年9月12日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

Investigating Meta-Learning Algorithms for Low-Resource Natural Language Understanding Tasks

Arxiv

5+阅读 · 2019年8月27日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

预训练语言模型fine-tuning近期进展概述

预训练语言模型fine-tuning近期进展概述

专知会员服务

36+阅读 · 2021年4月9日

【EMNLP2020-CMU&字节跳动】基于预训练语言模型的句子嵌入研究

【EMNLP2020-CMU&字节跳动】基于预训练语言模型的句子嵌入研究

专知会员服务

22+阅读 · 2020年11月14日

【CMU-TACL2020】低资源跨语言实体链接，Low-resource Crosslingual EntityLinking

专知会员服务

16+阅读 · 2020年3月29日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

49+阅读 · 2020年3月7日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

52+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

45+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

31+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

56+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

167+阅读 · 2019年10月11日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

64+阅读 · 2019年10月9日

热门VIP内容

相关资讯

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

23+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

25+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

17+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

41+阅读 · 2019年1月3日

Facebook PyText 在 Github 上开源了

Facebook PyText 在 Github 上开源了

AINLP

7+阅读 · 2018年12月14日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

相关论文

Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey

Arxiv

30+阅读 · 2021年11月1日

Pre-trained Language Model based Ranking in Baidu Search

Arxiv

9+阅读 · 2021年6月25日

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

Arxiv

3+阅读 · 2021年6月11日

PTR: Prompt Tuning with Rules for Text Classification

Arxiv

7+阅读 · 2021年5月24日

A Baseline for Few-Shot Image Classification

Arxiv

7+阅读 · 2020年3月1日

Zero-Resource Cross-Lingual Named Entity Recognition

Arxiv

5+阅读 · 2019年11月22日

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Arxiv

18+阅读 · 2019年9月25日

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Arxiv

3+阅读 · 2019年9月12日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

Investigating Meta-Learning Algorithms for Low-Resource Natural Language Understanding Tasks

Arxiv

5+阅读 · 2019年8月27日

微信扫码咨询专知VIP会员