Estimating human poses from videos is critical in human-computer interaction. By precisely estimating human poses, the robot can provide an appropriate response to the human. Most existing approaches use the optical flow, RNNs, or CNNs to extract temporal features from videos. Despite the positive results of these attempts, most of them only straightforwardly integrate features along the temporal dimension, ignoring temporal correlations between joints. In contrast to previous methods, we propose a plug-and-play kinematics modeling module (KMM) based on the domain-cross attention mechanism to model the temporal correlation between joints across different frames explicitly. Specifically, the proposed KMM models the temporal correlation between any two joints by calculating their temporal similarity. In this way, KMM can learn the motion cues of each joint. Using the motion cues (temporal domain) and historical positions of joints (spatial domain), KMM can infer the initial positions of joints in the current frame in advance. In addition, we present a kinematics modeling network (KIMNet) based on the KMM for obtaining the final positions of joints by combining pose features and initial positions of joints. By explicitly modeling temporal correlations between joints, KIMNet can infer the occluded joints at present according to all joints at the previous moment. Furthermore, the KMM is achieved through an attention mechanism, which allows it to maintain the high resolution of features. Therefore, it can transfer rich historical pose information to the current frame, which provides effective pose information for locating occluded joints. Our approach achieves state-of-the-art results on two standard video-based pose estimation benchmarks. Moreover, the proposed KIMNet shows some robustness to the occlusion, demonstrating the effectiveness of the proposed method.
翻译:从视频中估算人的外貌是人体-计算机互动的关键。精确估计人的外貌,机器人可以提供对人类的适当反应。 多数现有办法都使用光学流、 RNNs 或CNNs 来从视频中提取时间特征。 尽管这些尝试取得了积极的结果, 但大多数尝试都只是直接结合了时间层面的特征, 忽略了连接之间的时间相关性。 与以往方法不同, 我们提议基于域际关注机制的插接和播放动态模型模块( KMMM), 以模拟不同框架间联点之间的时间相关性。 具体地说, 拟议的 KMMM 模型可以使用任何两个联合点之间的时间相关性。 KMMM 可以通过将历史特征( 时间域域) 和前几个联合点( 空间域) 的历史位置直接地整合。 KMMM 与之前提议的当前框架中联合点的初始位置相比, 我们提出一个动态模型( KIMNet ), 以两个基点之间获取当前联合点的最终位置, KIM 将当前联合基点的基点和基点的基点组合基点的基点显示共同基点的基点的基点的基点显示。