Chinese pre-trained language models usually process text as a sequence of characters, while ignoring more coarse granularity, e.g., words. In this work, we propose a novel pre-training paradigm for Chinese -- Lattice-BERT, which explicitly incorporates word representations along with characters, thus can model a sentence in a multi-granularity manner. Specifically, we construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers. We design a lattice position attention mechanism to exploit the lattice structures in self-attention layers. We further propose a masked segment prediction task to push the model to learn from rich but redundant information inherent in lattices, while avoiding learning unexpected tricks. Experiments on 11 Chinese natural language understanding tasks show that our model can bring an average increase of 1.5% under the 12-layer setting, which achieves new state-of-the-art among base-size models on the CLUE benchmarks. Further analysis shows that Lattice-BERT can harness the lattice structures, and the improvement comes from the exploration of redundant information and multi-granularity representations. Our code will be available at https://github.com/alibaba/pretrained-language-models/LatticeBERT.
翻译:在这项工作中,我们提出了一个新的中国人培训前模式 -- -- Lattice-BERT, 明确将文字表达与字符结合起来,从而能够以多语种的方式模拟句子。具体地说,我们从一个句子中的字符和词组中构建了一个拉特斯图,并将所有这些文本单元中的所有文本单元输入变异器。我们设计了一个拉特斯位置注意机制,在自我注意层中利用拉特斯结构。我们进一步提议了一个蒙面部分预测任务,以推动该模式,从拉特克所固有的丰富但多余的信息中学习,同时避免学习出乎意料的把戏。关于11个中国自然语言理解任务的实验表明,我们的模型可以在12层设置下带来1.5%的平均增长幅度,从而在CLUE基准中实现新的基本规模模型中的最新状态。我们进一步的分析显示,Lattice-BERT可以利用拉特斯结构,而改进则来自对冗余信息探索和多语言模式/多语言模式的改进。