Human pose estimation has achieved significant progress in recent years. However, most of the recent methods focus on improving accuracy using complicated models and ignoring real-time efficiency. To achieve a better trade-off between accuracy and efficiency, we propose a novel neural architecture search (NAS) method, termed ViPNAS, to search networks in both spatial and temporal levels for fast online video pose estimation. In the spatial level, we carefully design the search space with five different dimensions including network depth, width, kernel size, group number, and attentions. In the temporal level, we search from a series of temporal feature fusions to optimize the total accuracy and speed across multiple video frames. To the best of our knowledge, we are the first to search for the temporal feature fusion and automatic computation allocation in videos. Extensive experiments demonstrate the effectiveness of our approach on the challenging COCO2017 and PoseTrack2018 datasets. Our discovered model family, S-ViPNAS and T-ViPNAS, achieve significantly higher inference speed (CPU real-time) without sacrificing the accuracy compared to the previous state-of-the-art methods.
翻译:近些年来,人类的面貌估计取得了显著进展。然而,最近采用的方法大多侧重于使用复杂的模型提高准确性,忽视实时效率。为了在准确性和效率之间实现更好的权衡,我们提出了一种新的神经结构搜索(NAS)方法,称为ViPNAS,以空间和时间水平搜索网络进行快速在线视频估计。在空间层面,我们仔细设计了具有五个不同维度的搜索空间,包括网络深度、宽度、内核大小、群号和注意力。在时间层面,我们从一系列时间特征聚合中搜索,以优化多个视频框架的总准确性和速度。我们最了解的是,我们首先在视频中搜索时间特征聚合和自动计算分配。广泛的实验表明我们在具有挑战性的COCO2017和PoseTrack2018数据集方面的做法的有效性。我们发现的模型家庭S-ViPNAS和T-VIPNAS,在不牺牲先前的状态方法的准确性的情况下,实现了显著的推断速度(CPU实时)。