Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices. We will open-source code and data to facilitate future research.
翻译:了解以自我为中心的 RGB 视频的动态手动和动作是一项根本性但具有挑战性的任务,因为自我排斥和模糊。为了解决排斥和模糊问题,我们开发了一个基于变压器的框架,利用时间信息进行稳健的估计。我们注意到手与行动之间不同的时间颗粒和语义相关性,我们用两个级联变压器构建了网络等级,第一个利用短期的手时间提示进行估计,而后一个框架则在较长的时间内将每个框架的构成和目标信息汇总起来,以识别行动。我们的方法在两种第一手行动基准(即FPHA和H2O)上取得了竞争性结果。广泛的通货膨胀研究证实了我们的设计选择。我们将开发源代码和数据,以便利未来的研究。