Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.
翻译:了解以自我为中心的 RGB 视频的动态手动和动作是一项根本性但具有挑战性的任务,因为自我排斥和模糊性。 为了解决排斥和模糊问题,我们开发了一个基于变压器的框架,以利用时间信息进行稳健的估计。 我们注意到手与行动之间不同的时间颗粒和语义相关性,因此,我们用两个级联变压器构建了网络等级,第一个是利用短期的手时间提示进行估计,而后一个则在较长的时间内将每个框架的构成和对象信息汇总起来,以识别行动。 我们的方法在两种第一手行动基准(即FPHA和H2O)上取得了竞争性结果。 广泛的通货膨胀研究证实了我们的设计选择。</s>