Weight sharing is a popular approach to reduce the cost of neural architecture search (NAS) by reusing the weights of shared operators from previously trained child models. However, the rank correlation between the estimated accuracy and ground truth accuracy of those child models is low due to the interference among different child models caused by weight sharing. In this paper, we investigate the interference issue by sampling different child models and calculating the gradient similarity of shared operators, and observe: 1) the interference on a shared operator between two child models is positively correlated with the number of different operators; 2) the interference is smaller when the inputs and outputs of the shared operator are more similar. Inspired by these two observations, we propose two approaches to mitigate the interference: 1) MAGIC-T: rather than randomly sampling child models for optimization, we propose a gradual modification scheme by modifying one operator between adjacent optimization steps to minimize the interference on the shared operators; 2) MAGIC-A: forcing the inputs and outputs of the operator across all child models to be similar to reduce the interference. Experiments on a BERT search space verify that mitigating interference via each of our proposed methods improves the rank correlation of super-pet and combining both methods can achieve better results. Our discovered architecture outperforms RoBERTa$_{\rm base}$ by 1.1 and 0.6 points and ELECTRA$_{\rm base}$ by 1.6 and 1.1 points on the dev and test set of GLUE benchmark. Extensive results on the BERT compression, reading comprehension and ImageNet task demonstrate the effectiveness and generality of our proposed methods.
翻译:分享体重是一种流行的方法,通过重新使用以前受过训练的儿童模型中共有操作员的重量,降低神经结构搜索成本(NAS),从而降低神经结构搜索成本(NAS)的流行方法。然而,由于体重共享对不同儿童模型的干扰,估计这些儿童模型的准确性和地面真实性之间的等级相关性较低。在本文中,我们通过抽样不同儿童模型和计算共享操作员的梯度相似性来调查干扰问题,并观察:(1) 两个儿童模型之间共享操作员的干扰与不同操作员的数量有着积极的关系;(2) 当共享操作员的投入和产出更为相似时,干扰就较小。在这两项观察的启发下,我们提出了两种减轻干扰的方法:(1) MAGIC-T:而不是随机抽样儿童模型,以便优化。 我们提出一个渐进的修改方案,在相邻的优化步骤之间修改一个操作员,以尽量减少对共享操作员的干扰;(2) MAIC-A:迫使所有儿童模型的操作员的投入和产出与减少干扰程度。BERT搜索空间的实验证实,通过我们提出的每种阅读方法的平价基础和GL的基数和BL结果可以改善我们所发现的标准和BAR标准的结果。