Estimating the relative pose of a new object without prior knowledge is a hard problem, while it is an ability very much needed in robotics and Augmented Reality. We present a method for tracking the 6D motion of objects in RGB video sequences when neither the training images nor the 3D geometry of the objects are available. In contrast to previous works, our method can therefore consider unknown objects in open world instantly, without requiring any prior information or a specific training phase. We consider two architectures, one based on two frames, and the other relying on a Transformer Encoder, which can exploit an arbitrary number of past frames. We train our architectures using only synthetic renderings with domain randomization. Our results on challenging datasets are on par with previous works that require much more information (training images of the target objects, 3D models, and/or depth data). Our source code is available at https://github.com/nv-nguyen/pizza
翻译:在没有事先知识的情况下估计新对象的相对面貌是一个棘手的问题,而机器人和增强现实则非常需要这种能力。当没有培训图像或天体的3D几何仪时,我们提出了一个在 RGB 视频序列中跟踪物体的6D运动的方法。与以前的工作不同,我们的方法因此可以立即考虑开放世界中的未知物体,而无需事先获得任何信息或特定培训阶段。我们考虑的是两种结构,一种建筑基于两个框架,另一种建筑依赖于变异器编码,它可以任意利用一些过去的框架。我们仅使用带域随机化的合成图像来培训我们的建筑。我们关于挑战性数据集的结果与以前需要更多信息的工作(目标物体的培训图像、3D模型和/或深度数据)相同。我们的源代码可在https://github.com/nv-nguyen/pizza查阅。我们的源代码可以在https://github.com/n-nguyen/pizza查阅。