深度学习与自然语言处理领域的研究人员标题: 是单字、词或两者兼用？——重访面向中文预训练语言模型的分词粒度 (Character, Word, or Both? Revisiting the Segmentation Granularity for Chinese Pre-trained Language Models) - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · Extensibility · Performer · HTTPS ·

2023 年 3 月 20 日

Character, Word, or Both? Revisiting the Segmentation Granularity for Chinese Pre-trained Language Models

翻译：深度学习与自然语言处理领域的研究人员标题: 是单字、词或两者兼用？——重访面向中文预训练语言模型的分词粒度

Xinnian Liang,Zefan Zhou,Hui Huang,Shuangzhi Wu,Tong Xiao,Muyun Yang,Zhoujun Li,Chao Bian

from arxiv, preprint

Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code and model have been released here~\footnote{https://github.com/xnliang98/MigBERT}.

翻译：摘要: 预训练语言模型 (PLM) 在各种自然语言处理任务中取得了巨大的进展。大多数中文预训练语言模型只是将输入的文本作为一个单字序列，并完全忽略了词语信息。虽然采用了全词掩码技术，但是词语中的语义仍然没有得到很好的表示。在本文中，我们重访了面向中文预训练语言模型的分词粒度问题。我们提出了一种混合粒度中文BERT模型(MigBERT)，同时考虑了单字和词语信息。为了实现这一点，我们设计了学习单字和词级表示的目标函数。我们在各种中文NLP任务上进行了大量实验，以评估现有的预训练语言模型以及提出的MigBERT。实验结果表明，MigBERT在所有这些任务上都取得了新的最佳结果。进一步分析表明，词语比单字的语义更加丰富。更有趣的是，我们展示了MigBERT 在日本语数据上也有良好的表现。我们的代码和模型在此处发布~\footnote{https://github.com/xnliang98/MigBERT}。

0

相关内容

语言模型化

语言模型化

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日

【IJCAI2020】神经摘要结构性注意力，Neural Abstractive Summarization with Structural Attention

【IJCAI2020】神经摘要结构性注意力，Neural Abstractive Summarization with Structural Attention

专知会员服务

33+阅读 · 2020年4月24日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

专知会员服务

21+阅读 · 2019年12月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

使用BERT做文本摘要

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

命名实体识别新SOTA：改进Transformer模型

命名实体识别新SOTA：改进Transformer模型

AI科技评论

17+阅读 · 2019年11月26日

一文读懂最强中文NLP预训练模型ERNIE

一文读懂最强中文NLP预训练模型ERNIE

AINLP

25+阅读 · 2019年10月22日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

基于Lattice LSTM的命名实体识别

基于Lattice LSTM的命名实体识别

微信AI

47+阅读 · 2018年10月19日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

大规模汉语历时语料库建设及词汇语义变迁研究

国家自然科学基金

1+阅读 · 2014年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

面向故障封闭的航空电子分区综合模型完整性理论研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向企业的商品评论代表性意见提取策略研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于多智能体演化博弈的柔性作业车间生产计划与调度集成优化研究

国家自然科学基金

3+阅读 · 2013年12月31日

汉藏双语跨语言语音转换中的关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

基于在线百科和问答社区的中文文本蕴涵知识获取

国家自然科学基金

0+阅读 · 2011年12月31日

基于视觉感知的中国书画图像语义自动分类研究

国家自然科学基金

0+阅读 · 2009年12月31日

跨语言文本自动分类关键技术研究

国家自然科学基金

2+阅读 · 2008年12月31日

What is the best recipe for character-level encoder-only modelling?

Arxiv

0+阅读 · 2023年5月9日

Large Language Models Need Holistically Thought in Medical Conversational QA

Arxiv

0+阅读 · 2023年5月9日

VCSUM: A Versatile Chinese Meeting Summarization Dataset

Arxiv

0+阅读 · 2023年5月9日

CSED: A Chinese Semantic Error Diagnosis Corpus

Arxiv

0+阅读 · 2023年5月9日

Revisiting Relation Extraction in the era of Large Language Models

Arxiv

0+阅读 · 2023年5月8日

Language Independent Neuro-Symbolic Semantic Parsing for Form Understanding

Arxiv

0+阅读 · 2023年5月8日

RSC-VAE: Recoding Semantic Consistency Based VAE for One-Class Novelty Detection

Arxiv

0+阅读 · 2023年5月7日

Block the Label and Noise: An N-Gram Masked Speller for Chinese Spell Checking

Arxiv

0+阅读 · 2023年5月5日

A Survey of Knowledge-Enhanced Pre-trained Language Models

Arxiv

18+阅读 · 2022年11月17日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日

【IJCAI2020】神经摘要结构性注意力，Neural Abstractive Summarization with Structural Attention

【IJCAI2020】神经摘要结构性注意力，Neural Abstractive Summarization with Structural Attention

专知会员服务

33+阅读 · 2020年4月24日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

【技术报告】诺亚开源中文预训练语言模型“哪吒”（NEZHA: Neural Contextualized Representation for Chinese Language Understanding）

专知会员服务

21+阅读 · 2019年12月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

数据驱动死亡：以色列AI战争机器如何锁定目标

【普林斯顿博士论文】通过以人为本的评估推动负责任的人工智能

ICML 2025 | BiAssemble: 双臂机器人几何拼合问题的协同可供性学习

ICML 2025杰出论文出炉：8篇获奖，南大研究者榜上有名

相关资讯

使用BERT做文本摘要

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

命名实体识别新SOTA：改进Transformer模型

命名实体识别新SOTA：改进Transformer模型

AI科技评论

17+阅读 · 2019年11月26日

一文读懂最强中文NLP预训练模型ERNIE

一文读懂最强中文NLP预训练模型ERNIE

AINLP

25+阅读 · 2019年10月22日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

基于Lattice LSTM的命名实体识别

基于Lattice LSTM的命名实体识别

微信AI

47+阅读 · 2018年10月19日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

相关论文

What is the best recipe for character-level encoder-only modelling?

Arxiv

0+阅读 · 2023年5月9日

Large Language Models Need Holistically Thought in Medical Conversational QA

Arxiv

0+阅读 · 2023年5月9日

VCSUM: A Versatile Chinese Meeting Summarization Dataset

Arxiv

0+阅读 · 2023年5月9日

CSED: A Chinese Semantic Error Diagnosis Corpus

Arxiv

0+阅读 · 2023年5月9日

Revisiting Relation Extraction in the era of Large Language Models

Arxiv

0+阅读 · 2023年5月8日

Language Independent Neuro-Symbolic Semantic Parsing for Form Understanding

Arxiv

0+阅读 · 2023年5月8日

RSC-VAE: Recoding Semantic Consistency Based VAE for One-Class Novelty Detection

Arxiv

0+阅读 · 2023年5月7日

Block the Label and Noise: An N-Gram Masked Speller for Chinese Spell Checking

Arxiv

0+阅读 · 2023年5月5日

A Survey of Knowledge-Enhanced Pre-trained Language Models

Arxiv

18+阅读 · 2022年11月17日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

相关基金

大规模汉语历时语料库建设及词汇语义变迁研究

国家自然科学基金

1+阅读 · 2014年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

面向故障封闭的航空电子分区综合模型完整性理论研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向企业的商品评论代表性意见提取策略研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于多智能体演化博弈的柔性作业车间生产计划与调度集成优化研究

国家自然科学基金

3+阅读 · 2013年12月31日

汉藏双语跨语言语音转换中的关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

基于在线百科和问答社区的中文文本蕴涵知识获取

国家自然科学基金

0+阅读 · 2011年12月31日

基于视觉感知的中国书画图像语义自动分类研究

国家自然科学基金

0+阅读 · 2009年12月31日

跨语言文本自动分类关键技术研究

国家自然科学基金

2+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员