Most existing neural architecture search (NAS) benchmarks and algorithms prioritize well-studied tasks, e.g. image classification on CIFAR or ImageNet. This makes the performance of NAS approaches in more diverse areas poorly understood. In this paper, we present NAS-Bench-360, a benchmark suite to evaluate methods on domains beyond those traditionally studied in architecture search, and use it to address the following question: do state-of-the-art NAS methods perform well on diverse tasks? To construct the benchmark, we curate ten tasks spanning a diverse array of application domains, dataset sizes, problem dimensionalities, and learning objectives. Each task is carefully chosen to interoperate with modern CNN-based search methods while possibly being far-afield from its original development domain. To speed up and reduce the cost of NAS research, for two of the tasks we release the precomputed performance of 15,625 architectures comprising a standard CNN search space. Experimentally, we show the need for more robust NAS evaluation of the kind NAS-Bench-360 enables by showing that several modern NAS procedures perform inconsistently across the ten tasks, with many catastrophically poor results. We also demonstrate how NAS-Bench-360 and its associated precomputed results will enable future scientific discoveries by testing whether several recent hypotheses promoted in the NAS literature hold on diverse tasks. NAS-Bench-360 is hosted at https://nb360.ml.cmu.edu.
翻译:大部分现有的神经结构搜索(NAS)360基准和算法都把研究周密的任务列为优先事项,例如CIFAR或图像网络的图像分类。这使得对NAS在更多样化领域的做法的绩效了解不足。在本文中,我们介绍了NAS-Bench-360基准套件,用于评价建筑搜索中传统研究范围以外的领域的方法,并用来解决以下问题:由CNNS标准搜索空间构成的15,625个建筑的预定性能是否很好。实验性地,我们证明有必要对NAS-Bench-360等不同应用领域、数据集大小、问题维度和学习目标进行更强有力的NAS-360的多样化评估。每个任务都经过仔细选择,与现代CNNS搜索方法进行互动,而可能与最初开发领域相距甚远。为了加快和降低NAS研究的成本,我们公布由CNNS标准搜索空间构成的15,625个建筑的预定性能很好。我们显示,需要更强有力的NAS-Bench-360类的多样化评估,通过显示一些现代NAS的文献程序在10项任务中执行不连贯的最近的科学测试结果,也能够使NAS-S-S-toxx前的多项试验产生灾难性结果。