Most of the Chinese pre-trained models adopt characters as basic units for downstream tasks. However, these models ignore the information carried by words and thus lead to the loss of some important semantics. In this paper, we propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models. Specifically, we project a word's embedding into its internal characters' embeddings according to the similarity weight. To strengthen the word boundary information, we mix the representations of the internal characters within a word. After that, we apply a word-to-character alignment attention mechanism to emphasize important characters by masking unimportant ones. Moreover, in order to reduce the error propagation caused by word segmentation, we present an ensemble approach to combine segmentation results given by different tokenizers. The experimental results show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks: sentiment classification, sentence pair matching, natural language inference and machine reading comprehension. We make further analysis to prove the effectiveness of each component of our model.
翻译:大多数经过培训的中国模型都采用字符作为下游任务的基本单位。 但是, 这些模型忽略了文字中的信息, 从而导致一些重要语义的丧失。 在本文中, 我们提出一种新的方法来利用文字结构, 并将词汇语义纳入经过培训的模型的性格表示中。 具体地说, 我们根据相似的份量, 将单词嵌入其内部字符的嵌入。 为加强字义边界信息, 我们在一个单词中将内部字符的表达方式混合在一起。 之后, 我们运用一个字对字的调调和关注机制, 来通过遮盖不重要的文字来强调重要字符。 此外, 为了减少因文字分割造成的错误传播, 我们提出了一种混合方法, 将不同符号的分解结果组合在一起。 实验结果显示, 我们的方法在经过培训的基本模型BERT、 BERT-wmm 和 ENIENIE 上取得了优异的成绩: 情绪分类、 句配对、 自然语言精度和机器阅读理解。 我们进一步分析, 以证明模型每个组成部分的有效性。