Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.
翻译:理解Egocentric RGB视频中动态手部运动和动作是一项基本但具有挑战性的任务,由于自我遮挡和歧义性而变得困难。为了解决遮挡和歧义,我们开发了一个基于转换器的框架来利用时间信息进行稳健的估计。我们注意到手部姿态估计和动作识别之间的不同时间粒度和语义相关性,因此我们建立了一个具有两个级联转换器编码器的网络层次结构, 第一个编码器利用短期时间线索进行手部姿态估计,第二个编码器在更长的时间跨度内聚合每帧姿态和物体信息以识别动作。我们的方法在两个第一人称手部行为基准数据集,即FPHA和H2O上实现了竞争对手的结果。大量的消融研究验证了我们的设计选择。