Many image-based perception tasks can be formulated as detecting, associating and tracking semantic keypoints, e.g., human body pose estimation and tracking. In this work, we present a general framework that jointly detects and forms spatio-temporal keypoint associations in a single stage, making this the first real-time pose detection and tracking algorithm. We present a generic neural network architecture that uses Composite Fields to detect and construct a spatio-temporal pose which is a single, connected graph whose nodes are the semantic keypoints (e.g., a person's body joints) in multiple frames. For the temporal associations, we introduce the Temporal Composite Association Field (TCAF) which requires an extended network architecture and training method beyond previous Composite Fields. Our experiments show competitive accuracy while being an order of magnitude faster on multiple publicly available datasets such as COCO, CrowdPose and the PoseTrack 2017 and 2018 datasets. We also show that our method generalizes to any class of semantic keypoints such as car and animal parts to provide a holistic perception framework that is well suited for urban mobility such as self-driving cars and delivery robots.
翻译:许多基于图像的认知任务可以被设计成检测、关联和跟踪语义关键点,例如人体构成估计和跟踪。在这项工作中,我们提出了一个总体框架,在单一阶段共同检测和形成时空关键点协会,使这是第一个实时的检测和跟踪算法。我们提出了一个通用神经网络结构,利用复合场探测和构建一个空间-时空结构,这是一个单一的、连接的图形,其节点是多个框架的语义关键点(例如人的身体联合点)。对于时间联系,我们引入了时空复合协会(TTCAF),这需要超越以前的合成场,扩大网络结构和培训方法。我们的实验显示了竞争性的准确性,同时对多种公开提供的数据集,如COCO、CrowdPose、PoseTrack 2017和201818数据集, 是一个数量级的序列。我们还表明,我们的方法一般化为诸如汽车和动物汽车等任何一类的语义关键点,为汽车和动物汽车的自我移动性提供了一个整体化框架。