Quantifying motion in 3D is important for studying the behavior of humans and other animals, but manual pose annotations are expensive and time-consuming to obtain. Self-supervised keypoint discovery is a promising strategy for estimating 3D poses without annotations. However, current keypoint discovery approaches commonly process single 2D views and do not operate in the 3D space. We propose a new method to perform self-supervised keypoint discovery in 3D from multi-view videos of behaving agents, without any keypoint or bounding box supervision in 2D or 3D. Our method uses an encoder-decoder architecture with a 3D volumetric heatmap, trained to reconstruct spatiotemporal differences across multiple views, in addition to joint length constraints on a learned 3D skeleton of the subject. In this way, we discover keypoints without requiring manual supervision in videos of humans and rats, demonstrating the potential of 3D keypoint discovery for studying behavior.
翻译:3D 中的量化运动对于研究人类和其他动物的行为很重要,但人工显示说明是昂贵和耗时的。自我监督的关键点发现是估算 3D 的无注释的很有希望的战略。 但是,当前的关键点发现方法通常处理单一的 2D 视图,不在 3D 空间运行。 我们建议了一种新的方法,从行为物剂的多视图视频中进行自我监督的3D 关键点发现,而没有2D 或 3D 的键点或捆绑框监督。 我们的方法使用一个带有 3D 体积热映的编码-解码结构,经过培训可以重建多种观点之间的波段时差,此外还对3D 该主题的3D骨架有共同的长度限制。 这样,我们发现关键点时不需要对人和老鼠的视频进行人工监督,从而展示了 3D 关键点发现研究行为的潜力。