This work investigates the usage of batch normalization in neural architecture search (NAS). Specifically, Frankle et al. find that training BatchNorm only can achieve nontrivial performance. Furthermore, Chen et al. claim that training BatchNorm only can speed up the training of the one-shot NAS supernet over ten times. Critically, there is no effort to understand 1) why training BatchNorm only can find the perform-well architectures with the reduced supernet-training time, and 2) what is the difference between the train-BN-only supernet and the standard-train supernet. We begin by showing that the train-BN-only networks converge to the neural tangent kernel regime, obtain the same training dynamics as train all parameters theoretically. Our proof supports the claim to train BatchNorm only on supernet with less training time. Then, we empirically disclose that train-BN-only supernet provides an advantage on convolutions over other operators, cause unfair competition between architectures. This is due to only the convolution operator being attached with BatchNorm. Through experiments, we show that such unfairness makes the search algorithm prone to select models with convolutions. To solve this issue, we introduce fairness in the search space by placing a BatchNorm layer on every operator. However, we observe that the performance predictor in Chen et al. is inapplicable on the new search space. To this end, we propose a novel composite performance indicator to evaluate networks from three perspectives: expressivity, trainability, and uncertainty, derived from the theoretical property of BatchNorm. We demonstrate the effectiveness of our approach on multiple NAS-benchmarks (NAS-Bench101, NAS-Bench-201) and search spaces (DARTS search space and MobileNet search space).
翻译:这项工作调查了神经结构搜索(NAS)中批量正常化的使用情况。 具体地说, Frankle等人发现, 培训批量Norm 只能达到非三角性性能。 此外, Chen等人声称, 培训批量Norm 只能加快一次性NAS 超级网的训练10次。 关键地说, 我们没有努力去理解 1 为何培训批量Norm 只能找到使用超网络培训时间减少的性能- 运行状态结构, 并且2) 火车- BN - 唯一的超级网和标准塔内快递的超级网之间有什么区别。 我们首先显示, 火车- BAN 的网络会接近神经性能系统, 获得与所有参数理论上培训相同的培训动态。 我们的证据表明, 仅在超级网络上培训NatchNornor 高级网络可以找到比其他操作员更低的变速性能结构, 导致结构之间的竞争不公不公不公 。 这要归功于与Bennorn- Nar- NARS 测试, 我们通过每次的轨算算算算, 我们的搜索了这个变换的轨性能到 。 我们通过每一个级的运行的轨算算算算 显示, 。