PA&DA: 一致的NAS联合采样板和Data</s> (PA&DA: Jointly Sampling PAth and DAta for Consistent NAS)

Based on the weight-sharing mechanism, one-shot NAS methods train a supernet and then inherit the pre-trained weights to evaluate sub-models, largely reducing the search cost. However, several works have pointed out that the shared weights suffer from different gradient descent directions during training. And we further find that large gradient variance occurs during supernet training, which degrades the supernet ranking consistency. To mitigate this issue, we propose to explicitly minimize the gradient variance of the supernet training by jointly optimizing the sampling distributions of PAth and DAta (PA&DA). We theoretically derive the relationship between the gradient variance and the sampling distributions, and reveal that the optimal sampling probability is proportional to the normalized gradient norm of path and training data. Hence, we use the normalized gradient norm as the importance indicator for path and training data, and adopt an importance sampling strategy for the supernet training. Our method only requires negligible computation cost for optimizing the sampling distributions of path and data, but achieves lower gradient variance during supernet training and better generalization performance for the supernet, resulting in a more consistent NAS. We conduct comprehensive comparisons with other improved approaches in various search spaces. Results show that our method surpasses others with more reliable ranking performance and higher accuracy of searched architectures, showing the effectiveness of our method. Code is available at https://github.com/ShunLu91/PA-DA.

翻译：根据权重分担机制,国家抽样调查系统采用一次性国家抽样调查方法,培训一个超级网,然后继承预先培训的重量,以评价次级模型,这在很大程度上降低了搜索成本。然而,一些工作指出,在培训期间,共享权重受到不同的梯度下降趋势;我们进一步发现,在超级网培训期间出现很大的梯度差异,这降低了超级网排名的一致性。为了缓解这一问题,我们提议通过共同优化Path和Data(PA&DA)的抽样分布,明确将超级网培训的梯度差异最小化。我们理论上得出梯度差异与抽样分布之间的关系,并表明最佳采样概率概率与路径和培训数据的正常梯度标准成正比。因此,我们使用普通梯度标准作为路径和培训数据的重要指标,并为超级网培训采用重要的取样战略。我们的方法只需要微不足道的计算成本来优化路径和数据的采样分布,但在超级网(Path和Data)的采样分布方面实现较低的梯度差异,而超级网(PA&DA)的普及性表现则更加一致,从而得出了梯度差异差差差差差差差差差异和抽样分布之间的关系,并显示最佳采样的概率概率概率概率概率概率概率概率与路径标准成正比,我们在各种搜索空间中采用其他更高的方法。结果。结果显示其他搜索方法,我们比比了其他更高的方法。</s>