We propose a method for learning the posture and structure of agents from unlabelled behavioral videos. Starting from the observation that behaving agents are generally the main sources of movement in behavioral videos, our method, Behavioral Keypoint Discovery (B-KinD), uses an encoder-decoder architecture with a geometric bottleneck to reconstruct the spatiotemporal difference between video frames. By focusing only on regions of movement, our approach works directly on input videos without requiring manual annotations. Experiments on a variety of agent types (mouse, fly, human, jellyfish, and trees) demonstrate the generality of our approach and reveal that our discovered keypoints represent semantically meaningful body parts, which achieve state-of-the-art performance on keypoint regression among self-supervised methods. Additionally, B-KinD achieve comparable performance to supervised keypoints on downstream tasks, such as behavior classification, suggesting that our method can dramatically reduce model training costs vis-a-vis supervised methods.
翻译:我们建议一种方法来从未贴标签的行为视频中学习代理物的姿态和结构。 从观察行为代理物一般是行为视频中移动的主要来源开始,我们的方法,即行为键点发现(B-KinD),使用带有几何瓶颈的编码器解码器结构来重建视频框架之间的时空差异。我们的方法仅侧重于移动区域,直接用于输入视频,而不需要手动说明。对各种代理物类型(摩尔斯、苍蝇、人类、水母和树木)的实验显示了我们的方法的普遍性,并揭示了我们发现的关键点代表了具有语义意义的身体部分,这些部分在自我监督的方法中实现了关键点回归的最先进的表现。此外,B-KinD在监管下游任务的关键点(如行为分类)上取得了类似的业绩,表明我们的方法可以大幅降低相对于监管方法的示范培训成本。