分析和减轻神经建筑搜索中的干扰 (Analyzing and Mitigating Interference in Neural Architecture Search)

Weight sharing has become the \textit{de facto} approach to reduce the training cost of neural architecture search (NAS) by reusing the weights of shared operators from previously trained child models. However, the estimated accuracy of those child models has a low rank correlation with the ground truth accuracy due to the interference among different child models caused by weight sharing. In this paper, we investigate the interference issue by sampling different child models and calculating the gradient similarity of shared operators, and observe that: 1) the interference on a shared operator between two child models is positively correlated to the number of different operators between them; 2) the interference is smaller when the inputs and outputs of the shared operator are more similar. Inspired by these two observations, we propose two approaches to mitigate the interference: 1) rather than randomly sampling child models for optimization, we propose a gradual modification scheme by modifying one operator between adjacent optimization steps to minimize the interference on the shared operators; 2) forcing the inputs and outputs of the operator across all child models to be similar to reduce the interference. Experiments on a BERT search space verify that mitigating interference via each of our proposed methods improves the rank correlation of super-pet and combining both methods can achieve better results. Our searched architecture outperforms RoBERTa$_{\rm base}$ by 1.1 and 0.6 scores and ELECTRA$_{\rm base}$ by 1.6 and 1.1 scores on the dev and test set of GLUE benchmark. Extensive results on the BERT compression task, SQuAD datasets and other search spaces also demonstrate the effectiveness and generality of our proposed methods.

翻译：在本文中,我们调查了不同儿童模型的干扰问题,并计算了共享操作员的梯度相似性,并指出:(1) 两个儿童模型对共享操作员的干扰与它们之间的不同操作员数目有正比关系;(2) 当共享操作员的投入和产出更为相似时,干扰程度较小。在这两项观察的启发下,我们提出了两种方法来减轻干扰:(1) 而不是随机取样儿童模型,以便优化。我们提出一个渐进式修改方案,即修改一个操作员的相邻优化步骤,以尽量减少对共享操作员的干扰;(2) 迫使所有儿童模型的操作员的投入和产出与减少干扰相似;2 在BERT搜索空间上进行实验,通过我们提议的每一种方法,降低对共享操作员的投入和产出的效果更为相似。