Human pose is a useful feature for fine-grained sports action understanding. However, pose estimators are often unreliable when run on sports video due to domain shift and factors such as motion blur and occlusions. This leads to poor accuracy when downstream tasks, such as action recognition, depend on pose. End-to-end learning circumvents pose, but requires more labels to generalize. We introduce Video Pose Distillation (VPD), a weakly-supervised technique to learn features for new video domains, such as individual sports that challenge pose estimation. Under VPD, a student network learns to extract robust pose features from RGB frames in the sports video, such that, whenever pose is considered reliable, the features match the output of a pretrained teacher pose detector. Our strategy retains the best of both pose and end-to-end worlds, exploiting the rich visual patterns in raw video frames, while learning features that agree with the athletes' pose and motion in the target video domain to avoid over-fitting to patterns unrelated to athletes' motion. VPD features improve performance on few-shot, fine-grained action recognition, retrieval, and detection tasks in four real-world sports video datasets, without requiring additional ground-truth pose annotations.
翻译:人类的姿势是了解细微的体育动作的有用特征。 但是,由于域变和运动模糊和隐蔽等因素,在体育视频上运行的显示显示器往往不可靠。 这导致在下游任务(如行动识别)取决于摆姿势时,导致不准确性。 端到端学习的规避会形成,但需要更多标签加以概括。 我们引入了视频Pose蒸馏(VPD),这是学习新视频领域特征的受微弱监督技术,如具有挑战性的个体体育,从而引起估计。 在VPD下,学生网络学会从体育视频中从RGB框中提取强健的显示功能,因此,只要认为姿势可靠,其特征就会与训练有素的教师姿势探测器的输出相匹配。 我们的战略保留着最佳的姿势和端到端世界,在原始视频框中利用丰富的视觉模式,同时学习与运动员的姿势和动作一致的功能,以避免与运动员运动无关的模式过于适应。 VPD的特征在体育场景象场上改进了几个镜头、精准的性功能,不需要额外的地面记录。