Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 79.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.
翻译:在本文件中,除了这种定型的层型模式外,我们的目标是通过从两个方面(层型和层级顺序)利用不同的层层来改进预先培训的模型。具体地说,除了最初的自谋和进料型结构外,我们还引入了层型系统,这在实验中发现对预先培训模式有好处。此外,除了最初的跨型系统外,我们还探索了更多的层级订单,以发现更强大的建筑。然而,引入的层型系统导致一个超过数十亿候选人的大型建筑空间,而从零到零的单一候选人模型的培训已经需要巨大的计算成本,因此无法直接培训大量候选模型来寻找这样的空间。为了解决这一问题,我们首先将所有候选模型的重量都从中继承过来,然后采用一种以培训前精度为指南的进化算法,以找到最佳结构。广泛的实验显示,由我们的方法获得的LE-BERT模型已经从零到一个巨大的建筑空间空间空间空间,而从零到一个单一的候选模型已经需要巨大的计算成本,使得我们无法通过直接培训大量候选模型来寻找这样一个空间。为了解决这个问题,我们的方法在18-LEERE-REB的高级测试中,在G-RV上取得了强大的基准上的各种模型,在G-RBBBBBBBBBBBB上,在各种模型上取得了强大的高BBBBBBB的高级模型上,在高的模型上,在高的模型上取得了强大的模型。