AMBERT: 具有多年级教学法的预培训语言模式 (AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization)

from arxiv, To be appeared in Findings of ACL2021. In this version, we develop a simplified method to improve the efficiency of AMBERT in inference, which still performs better than BERT with the same computational cost as BERT

Pre-trained language models such as BERT have exhibited remarkable performances in many tasks in natural language understanding (NLU). The tokens in the models are usually fine-grained in the sense that for languages like English they are words or sub-words and for languages like Chinese they are characters. In English, for example, there are multi-word expressions which form natural lexical units and thus the use of coarse-grained tokenization also appears to be reasonable. In fact, both fine-grained and coarse-grained tokenizations have advantages and disadvantages for learning of pre-trained language models. In this paper, we propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained and coarse-grained tokenizations. For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization, employs one encoder for processing the sequence of words and the other encoder for processing the sequence of the phrases, utilizes shared parameters between the two encoders, and finally creates a sequence of contextualized representations of the words and a sequence of contextualized representations of the phrases. Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE. The results show that AMBERT can outperform BERT in all cases, particularly the improvements are significant for Chinese. We also develop a method to improve the efficiency of AMBERT in inference, which still performs better than BERT with the same computational cost as BERT.

翻译：在自然语言理解(NLU)的许多任务中,诸如BERT等经过预先训练的语言模型在自然语言理解(NLU)中表现了显著的成绩。模型中的标志通常都是精细的,因为对于英语等语言来说,它们是文字或子字,对于中文等语言来说,它们是字符。例如,在英语中,多字表达式形成自然的词汇单位,因此使用粗化的象征表示式似乎也是合理的。事实上,精细的刻度和粗粗化的表示式对于学习经过训练的语言模型(NLU)都具有优缺点。在本文件中,我们提出了一个新的经过训练的语文模型,称为AMBERT(A多色化的BERT),根据精细微的和粗化的表示式的表示式。对于英语,AMBERT的顺序(含精细度的表示式表示式表示式)和中国语句(含精度的表示式表示器)的顺序,在SEMBERD的缩略度中,也使用一个特别的缩略的缩略的缩写顺序,在SIMBLD的缩图示中,在BROD的顺序中,在SLB的顺序中则中,在进行数据的顺序中,在BER的顺序中,在BERD的缩进进进进的顺序中,在BER的顺序中,在BLB的顺序中,在B的顺序中,在B的顺序中进行中进行中,在BLB的缩进进的缩的缩的缩的缩的顺序中则在B的顺序中进行一个总的顺序中,在B。