Applications in the field of augmented reality or robotics often require joint localisation and 6D pose estimation of multiple objects. However, most algorithms need one network per object class to be trained in order to provide the best results. Analysing all visible objects demands multiple inferences, which is memory and time-consuming. We present a new single-stage architecture called CASAPose that determines 2D-3D correspondences for pose estimation of multiple different objects in RGB images in one pass. It is fast and memory efficient, and achieves high accuracy for multiple objects by exploiting the output of a semantic segmentation decoder as control input to a keypoint recognition decoder via local class-adaptive normalisation. Our new differentiable regression of keypoint locations significantly contributes to a faster closing of the domain gap between real test and synthetic training data. We apply segmentation-aware convolutions and upsampling operations to increase the focus inside the object mask and to reduce mutual interference of occluding objects. For each inserted object, the network grows by only one output segmentation map and a negligible number of parameters. We outperform state-of-the-art approaches in challenging multi-object scenes with inter-object occlusion and synthetic training.
翻译:在增强的现实或机器人应用领域,通常需要共同定位,6D则显示多个对象。然而,大多数算法都需要每个对象类别都有一个网络才能得到培训,才能提供最佳结果。分析所有可见物体都需要多种推论,这是记忆和耗时的。我们展示了一个新的单阶段结构,称为CASAPose,它确定 2D-3D 对应词,用于在一次传送的 RGB 图像中显示多个不同对象的估计。它既快速又记忆有效,并且通过利用语义分解解器的输出作为控制输入,实现多个物体的高度精确性。对于每个插入的物体来说,网络只能通过地方的类整变正常化向关键点识别解码器提供控制输入。我们对关键点位置的新的可区别回归非常有助于更快地缩小实际测试和合成培训数据之间的领域差距。我们应用分解-觉变动和升级操作来增加对象遮罩内的焦点并减少对渗漏物体的相互干扰。对于每个插入的物体来说,网络只增长一个输出分解图和可忽略的参数数。我们在合成培训中采用不同状态和模拟的多位化方法。