Transformer-based pre-trained models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism inside Transformer may not be necessary, both convolutional neural networks and multi-layer perceptron based models have also been investigated as Transformer alternatives. In this paper, we consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications, together with a sentence-level representation decoupled from other tokens. The original model performs well in domain-specific text classification under supervised training, however, its potential in learning transfer knowledge by self-supervised way has not been fully exploited. We fill this gap by optimizing the architecture and verifying its effectiveness in more general language understanding tasks, for both English and Chinese languages. As for model efficiency, instead of the quadratic complexity in Transformer-based models, our model has linear complexity and performs more efficiently during inference. Moreover, we find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.
翻译:近些年来,基于变异器的预培训模型取得了长足的进步,成为自然语言处理中最重要的支柱之一。最近的工作表明,变异器内部的注意机制也许没有必要,变异神经网络和多层光谱模型也作为变异器替代品被调查。在本文中,我们考虑为语言模型预培训建立一个图形经常性网络,用当地象征性的通信为每个序列建立一个图形结构,同时从其他符号中分离出一个判决级别代表。但是,在受监督的培训中,原始模型在特定域文本分类方面表现良好,但在以自我监督的方式学习知识转移方面的潜力没有得到充分利用。我们通过优化结构并用更一般的语言理解英语和中文的任务核实其有效性来填补这一空白。关于模型效率,而不是基于变异器模型的四分复杂度,我们的模型具有线性复杂性,在推断过程中表现得更有效率。此外,我们发现,我们的模型可以产生比现有关注模型更不那么具有背景的特点重复的更多样化的产出。