Char2 Subword: 利用强力字符构成来扩展子词“嵌入空间” (Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality) - 专知论文

会员服务 ·

0

稳健性 · 组合性 · MoDELS · BERT · Processing（编程语言） ·

2021 年 9 月 24 日

Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality

翻译：Char2 Subword: 利用强力字符构成来扩展子词“嵌入空间”

Gustavo Aguilar,Bryan McCann,Tong Niu,Nazneen Rajani,Nitish Keskar,Thamar Solorio

from arxiv, Findings of EMNLP 2020

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed--and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.

翻译：字形比对编码( BBE) 是语言模型子词符号化过程的无处不在的算法, 因为它能提供多种好处。但是, 这一过程完全基于培训前的数据统计, 使符号化器难以处理不常见的拼写。另一方面, 纯字符级模型虽然对拼字错误来说是强大的, 但纯字符级模型往往会导致不合理的长序, 并使模型更难学习有意义的词汇。为了缓解这些挑战, 我们提议了一个基于字符的子字小字模块( Char2 Subword), 以学习在诸如 BERT 等预先训练模型中嵌入子字表的子字组。我们的字符2 子字组模块从子字组词汇表的字符中建立代表, 并且可以用作子字组嵌入的拼写表替换。这个模块对字符级的改变非常有力, 比如拼写错误、字串、字形、弹夹、外壳和标号。我们通过培训前将它进一步与 BERT 整合, 同时保持 BERT 变换参数的参数, 提供实用的方法。最后, 我们展示了将模块纳入到 mBERTERT 语言媒体的功能基的功能评估。

0

相关内容

稳健性

【干货书】开放数据结构，Open Data Structures，337页pdf

【干货书】开放数据结构，Open Data Structures，337页pdf

专知会员服务

17+阅读 · 2021年9月17日

【UCLA】动态图表示学习，40页ppt，Dynamic Graph Representation Learning

【UCLA】动态图表示学习，40页ppt，Dynamic Graph Representation Learning

专知会员服务

70+阅读 · 2021年3月7日

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

专知会员服务

31+阅读 · 2020年5月20日

【CVPR2020-哈工大-京东】自监督结构建模的目标识别，Self-supervised Structure Modeling

【CVPR2020-哈工大-京东】自监督结构建模的目标识别，Self-supervised Structure Modeling

专知会员服务

43+阅读 · 2020年4月1日

融合零样本学习和小样本学习的弱监督机器学习方法综述

专知会员服务

113+阅读 · 2020年3月20日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

《京东区块链技术实践白皮书》（2019版），95页PDF，京东数字科技编

《京东区块链技术实践白皮书》（2019版），95页PDF，京东数字科技编

专知会员服务

50+阅读 · 2019年11月9日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

自然语言处理顶会EMNLP2018接受论文列表！

自然语言处理顶会EMNLP2018接受论文列表！

专知

87+阅读 · 2018年8月26日

【论文推荐】最新六篇网络节点表示相关论文—传播网络嵌入、十亿级网络节点表示、综述、属性感知、贝叶斯个性化排序、复杂网络分类

【论文推荐】最新六篇网络节点表示相关论文—传播网络嵌入、十亿级网络节点表示、综述、属性感知、贝叶斯个性化排序、复杂网络分类

专知

5+阅读 · 2018年5月17日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

Learning Robust Scheduling with Search and Attention

Arxiv

0+阅读 · 2021年11月15日

Learning Enhancement in Higher Education with Wearable Technology

Arxiv

0+阅读 · 2021年11月14日

Link Prediction on N-ary Relational Facts: A Graph-based Approach

Arxiv

13+阅读 · 2021年5月18日

Contrastive Neural Architecture Search with Neural Architecture Comparators

Arxiv

4+阅读 · 2021年4月6日

Detect Camouflaged Spam Content via StoneSkipping: Graph and Text Joint Embedding for Chinese Character Variation Representation

Detect Camouflaged Spam Content via StoneSkipping: Graph and Text Joint Embedding for Chinese Character Variation Representation

Arxiv

3+阅读 · 2019年8月30日

End-to-End Text Classification via Image-based Embedding using Character-level Networks

End-to-End Text Classification via Image-based Embedding using Character-level Networks

Arxiv

5+阅读 · 2018年10月10日

Open Set Chinese Character Recognition using Multi-typed Attributes

Open Set Chinese Character Recognition using Multi-typed Attributes

Arxiv

4+阅读 · 2018年8月27日

Character-Level Feature Extraction with Densely Connected Networks

Character-Level Feature Extraction with Densely Connected Networks

Arxiv

5+阅读 · 2018年7月26日

Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition

Arxiv

3+阅读 · 2018年5月30日

The Search Problem in Mixture Models

Arxiv

3+阅读 · 2018年2月24日

VIP会员

文章信息

相关主题

Processing（编程语言）

相关VIP内容

【干货书】开放数据结构，Open Data Structures，337页pdf

【干货书】开放数据结构，Open Data Structures，337页pdf

专知会员服务

17+阅读 · 2021年9月17日

【UCLA】动态图表示学习，40页ppt，Dynamic Graph Representation Learning

【UCLA】动态图表示学习，40页ppt，Dynamic Graph Representation Learning

专知会员服务

70+阅读 · 2021年3月7日

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

专知会员服务

31+阅读 · 2020年5月20日

【CVPR2020-哈工大-京东】自监督结构建模的目标识别，Self-supervised Structure Modeling

【CVPR2020-哈工大-京东】自监督结构建模的目标识别，Self-supervised Structure Modeling

专知会员服务

43+阅读 · 2020年4月1日

融合零样本学习和小样本学习的弱监督机器学习方法综述

专知会员服务

113+阅读 · 2020年3月20日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

《京东区块链技术实践白皮书》（2019版），95页PDF，京东数字科技编

《京东区块链技术实践白皮书》（2019版），95页PDF，京东数字科技编

专知会员服务

50+阅读 · 2019年11月9日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《毁灭算法：解析以色列在加沙的AI军事行动》

【COLT 2025最新教程】语言生成

以机器速度锁定目标：人工智能的能力与局限

【ICML2025】通过在线世界模型规划的持续强化学习

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

自然语言处理顶会EMNLP2018接受论文列表！

自然语言处理顶会EMNLP2018接受论文列表！

专知

87+阅读 · 2018年8月26日

【论文推荐】最新六篇网络节点表示相关论文—传播网络嵌入、十亿级网络节点表示、综述、属性感知、贝叶斯个性化排序、复杂网络分类

【论文推荐】最新六篇网络节点表示相关论文—传播网络嵌入、十亿级网络节点表示、综述、属性感知、贝叶斯个性化排序、复杂网络分类

专知

5+阅读 · 2018年5月17日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

相关论文

Learning Robust Scheduling with Search and Attention

Arxiv

0+阅读 · 2021年11月15日

Learning Enhancement in Higher Education with Wearable Technology

Arxiv

0+阅读 · 2021年11月14日

Link Prediction on N-ary Relational Facts: A Graph-based Approach

Arxiv

13+阅读 · 2021年5月18日

Contrastive Neural Architecture Search with Neural Architecture Comparators

Arxiv

4+阅读 · 2021年4月6日

Detect Camouflaged Spam Content via StoneSkipping: Graph and Text Joint Embedding for Chinese Character Variation Representation

Detect Camouflaged Spam Content via StoneSkipping: Graph and Text Joint Embedding for Chinese Character Variation Representation

Arxiv

3+阅读 · 2019年8月30日

End-to-End Text Classification via Image-based Embedding using Character-level Networks

End-to-End Text Classification via Image-based Embedding using Character-level Networks

Arxiv

5+阅读 · 2018年10月10日

Open Set Chinese Character Recognition using Multi-typed Attributes

Open Set Chinese Character Recognition using Multi-typed Attributes

Arxiv

4+阅读 · 2018年8月27日

Character-Level Feature Extraction with Densely Connected Networks

Character-Level Feature Extraction with Densely Connected Networks

Arxiv

5+阅读 · 2018年7月26日

Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition

Arxiv

3+阅读 · 2018年5月30日

The Search Problem in Mixture Models

Arxiv

3+阅读 · 2018年2月24日

微信扫码咨询专知VIP会员