We propose an end-to-end trainable approach for multi-instance pose estimation, called POET (POse Estimation Transformer). Combining a convolutional neural network with a transformer encoder-decoder architecture, we formulate multiinstance pose estimation from images as a direct set prediction problem. Our model is able to directly regress the pose of all individuals, utilizing a bipartite matching scheme. POET is trained using a novel set-based global loss that consists of a keypoint loss, a visibility loss and a class loss. POET reasons about the relations between multiple detected individuals and the full image context to directly predict their poses in parallel. We show that POET achieves high accuracy on the COCO keypoint detection task while having less parameters and higher inference speed than other bottom-up and top-down approaches. Moreover, we show successful transfer learning when applying POET to animal pose estimation. To the best of our knowledge, this model is the first end-to-end trainable multi-instance pose estimation method and we hope it will serve as a simple and promising alternative.
翻译:我们提议了一种最终到最终可培训的多因子构成估计方法,称为POET(POSE Estimation 变异器)。结合一个革命性神经网络和一个变压器编码器-解码器结构,我们从图像中制定多重因子构成估计,作为直接设定的预测问题。我们的模型能够利用双向匹配方案,直接回归所有个人构成。POET是使用由关键点损失、可见度损失和阶级损失组成的基于新颖的一组全球损失来培训的。POET是多个被检测到的个人之间的关系以及直接同时预测其构成的全面图像背景的原因。我们表明,在COCO关键点探测任务上,POET在比其他自下至上和自上而下方法的参数和推断速度较低的同时,取得了很高的准确性。此外,我们展示了在应用POET对动物构成估计时的成功转移学习。据我们所知,这一模型是第一个端到端可培训的多因子估计方法,我们希望它能够成为一个简单而有希望的替代方法。