Vision Transformers have enabled recent attention-based Deep Learning (DL) architectures to achieve remarkable results in Computer Vision (CV) tasks. However, due to the extensive computational resources required, these architectures are rarely implemented on resource-constrained platforms. Current research investigates hybrid handcrafted convolution-based and attention-based models for CV tasks such as image classification and object detection. In this paper, we propose HyT-NAS, an efficient Hardware-aware Neural Architecture Search (HW-NAS) including hybrid architectures targeting vision tasks on tiny devices. HyT-NAS improves state-of-the-art HW-NAS by enriching the search space and enhancing the search strategy as well as the performance predictors. Our experiments show that HyT-NAS achieves a similar hypervolume with less than ~5x training evaluations. Our resulting architecture outperforms MLPerf MobileNetV1 by 6.3% accuracy improvement with 3.5x less number of parameters on Visual Wake Words.
翻译:注意力机制在深度学习(Deep Learning, DL)架构中的应用使得计算机视觉(Computer Vision, CV)任务取得了显著进展。然而,由于计算资源的大量消耗,这些架构很少在资源受限的平台上实现。当前研究关注于混合手工制作的卷积和注意力模型,用于CV任务,如图像分类和物体检测。在本文中,我们提出了HyT-NAS,这是一种高效的面向视觉任务的硬件感知神经架构搜索(Hardware-aware Neural Architecture Search, HW-NAS),包括混合架构,面向微型设备。HyT-NAS通过丰富搜索空间、增强搜索策略以及性能预测器来提高最先进的HW-NAS的效率。我们的实验表明,HyT-NAS在少于约5倍的训练评估的情况下实现了类似的超体积。我们得到的架构在Visual Wake Words上将MLPerf MobileNetV1的精度提高了6.3%,并且参数数量少了3.5倍。