神经架构搜索基准测试是否良好设计？对操作重要性的深入分析 (Are Neural Architecture Search Benchmarks Well Designed? A Deeper Look Into Operation Importance)

Neural Architecture Search (NAS) benchmarks significantly improved the capability of developing and comparing NAS methods while at the same time drastically reduced the computational overhead by providing meta-information about thousands of trained neural networks. However, tabular benchmarks have several drawbacks that can hinder fair comparisons and provide unreliable results. These usually focus on providing a small pool of operations in heavily constrained search spaces -- usually cell-based neural networks with pre-defined outer-skeletons. In this work, we conducted an empirical analysis of the widely used NAS-Bench-101, NAS-Bench-201 and TransNAS-Bench-101 benchmarks in terms of their generability and how different operations influence the performance of the generated architectures. We found that only a subset of the operation pool is required to generate architectures close to the upper-bound of the performance range. Also, the performance distribution is negatively skewed, having a higher density of architectures in the upper-bound range. We consistently found convolution layers to have the highest impact on the architecture's performance, and that specific combination of operations favors top-scoring architectures. These findings shed insights on the correct evaluation and comparison of NAS methods using NAS benchmarks, showing that directly searching on NAS-Bench-201, ImageNet16-120 and TransNAS-Bench-101 produces more reliable results than searching only on CIFAR-10. Furthermore, with this work we provide suggestions for future benchmark evaluations and design. The code used to conduct the evaluations is available at https://github.com/VascoLopes/NAS-Benchmark-Evaluation.

翻译：神经架构搜索（NAS）基准测试通过提供成千上万个经过训练的神经网络的元信息，显著提高了开发和比较NAS方法的能力，同时大大减少了计算开销。然而，表格型基准测试存在几个缺点，可能会阻碍公平比较并提供不可靠的结果。这些基准测试通常侧重于在受限的搜索空间内提供少量操作 - 通常是具有预定义外骨骼的基于单元的神经网络。在本文中，我们对广泛使用的NAS-Bench-101、NAS-Bench-201和TransNAS-Bench-101基准测试进行了经验分析，针对它们的可生成性以及不同操作对生成的体系结构性能的影响进行了更深入的研究。我们发现，仅需要操作池的子集即可生成接近于性能上限范围的体系结构。此外，性能分布呈负偏态，上限范围内的体系结构密度更高。我们一致发现卷积层对体系结构的性能影响最大，并且操作的特定组合有利于得分最高的体系结构。这些发现为使用NAS基准测试正确评估和比较NAS方法提供了深入的见解，显示直接在NAS-Bench-201、ImageNet16-120和TransNAS-Bench-101上进行搜索比仅在CIFAR-10上进行搜索产生的结果更可靠。此外，通过本文的工作，我们提供了未来基准测试评估和设计的建议。用于进行评估的代码可在 https://github.com/VascoLopes/NAS-Benchmark-Evaluation 上获得。