Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time, short-time HAR. The proposed methodology was extensively tested on MPOSE2021 and compared to several state-of-the-art architectures, proving the effectiveness of the AcT model and laying the foundations for future work on HAR.
翻译:在人类行动认知(HAR)中,关注机制主要在标准进化层或经常性层之上采用,提高了总体的概括能力;在这项工作中,我们引入了“行动变换器”(AcT),这是一个简单、完全自我注意的结构,它一贯优于更为完善的混合进化、经常性和专注层的网络;为了限制计算和能源请求,在以往人类行动识别研究的基础上,拟议方法利用了2D在小型时间窗口上的代表,为准确和有效实时性能提供了一种低延缓性解决方案;此外,我们开发了一个新的大规模数据组,试图为实时、短时间的HAR建立一个正式的培训和评估基准;拟议方法在MPOSE2021上进行了广泛测试,并与一些最先进的结构进行了比较,证明了AcT模型的有效性,并为未来工作打下了基础。