AutoBERT- Zero: 从 Scratch 逐步演变的 BERT 后骨 (AutoBERT-Zero: Evolving BERT Backbone from Scratch)

Transformer-based pre-trained language models like BERT and its variants have recently achieved promising performance in various natural language processing (NLP) tasks. However, the conventional paradigm constructs the backbone by purely stacking the manually designed global self-attention layers, introducing inductive bias and thus leading to sub-optimal. In this work, we propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures. Our well-designed search space (i) contains primitive math operations in the intra-layer level to explore novel attention structures, and (ii) leverages convolution blocks to be the supplementary for attention structure in the inter-layer level to better learn local dependency. We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS. Specifically, we propose Operation-Priority (OP) evolution strategy to facilitate model search via balancing exploration and exploitation. Furthermore, we design a Bi-branch Weight-Sharing (BIWS) training strategy for fast model evaluation. Extensive experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks, proving the architecture's transfer and generalization abilities. Remarkably, AutoBERT-Zero-base outperforms RoBERTa-base (using much more data) and BERT-large (with much larger model size) by 2.4 and 1.4 higher score on GLUE test set. Code and pre-trained models will be made publicly available.

翻译：在这项工作中,我们提议了一个基于行动重点的神经结构搜索(OP-NAS)算法,以自动搜索有希望的混合主干结构。我们设计的搜索空间(一) 包含在层内一级探索新的关注结构的原始数学操作,以及(二) 利用变动块作为跨层一级关注结构的补充,以更好地学习本地依赖性。我们优化了候选人模型的搜索算法和评估,以提高我们提议的OP-NAS的效率。具体地说,我们提议了一个行动重点(OP-NAS)演化战略,以便利通过平衡探索和开发来进行模型搜索。此外,我们设计了一个用于快速模型评估的Bi-branch Weight-Sharing(BIBWS) 模型(BIBWS) 之前的原始数学操作,以及(二) 利用更大型的变异型模型来探索结构(ABERT-L ), 大幅地展示其基础的搜索模型, 并大幅展示其基础(ABER-RER-RV) 和下游系统(BR-R-R-R-R-R-R-R-R-R-L) 的很多变型模型能力。