Neural architecture search (NAS) has demonstrated promising results on identifying efficient Transformer architectures which outperform manually designed ones for natural language tasks like neural machine translation (NMT). Existing NAS methods operate on a space of dense architectures, where all of the sub-architecture weights are activated for every input. Motivated by the recent advances in sparsely activated models like the Mixture-of-Experts (MoE) model, we introduce sparse architectures with conditional computation into the NAS search space. Given this expressive search space which subsumes prior densely activated architectures, we develop a new framework AutoMoE to search for efficient sparsely activated sub-Transformers. AutoMoE-generated sparse models obtain (i) 3x FLOPs reduction over manually designed dense Transformers and (ii) 23% FLOPs reduction over state-of-the-art NAS-generated dense sub-Transformers with parity in BLEU score on benchmark datasets for NMT. AutoMoE consists of three training phases: (a) Heterogeneous search space design with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?); (b) SuperNet training that jointly trains several subnetworks sampled from the large search space by weight-sharing; (c) Evolutionary search for the architecture with the optimal trade-off between task performance and computational constraint like FLOPs and latency. AutoMoE code, data and trained models are available at https://github.com/microsoft/AutoMoE.
翻译:神经结构搜索(NAS)在确定高效变压器结构方面展示了有希望的成果,这些结构在确定神经机翻译(NMT)等自然语言任务方面超越了人工设计的自动设计结构。现有的NAS方法在密集建筑空间运行,所有次建筑结构的重量都激活了每项输入。受Mixture of-Experts (MoE) 模型等微活化模型的最新进展的驱动,我们在NAS搜索空间中引入了有条件计算过的稀薄结构。鉴于这种显示式搜索空间,在前密集激活结构中进行分包,我们开发了一个新的自动MoEE框架,以寻找高效的分散激活子变压器结构。自动MoMOE 生成的稀薄模型获得了(i) 3x FLOP 的减幅, 手动密集变压式变压器的减幅, 在NMTT(MT) 的BLEE值计算中, 自动变压式变压器的分数。 AMoE 由三个培训阶段组成:(a) Heterous- 搜索空间空间设计设计师,在密集和低压式变压式变压式变压的模型中,应该如何进行。