Object pose estimation is an integral part of robot vision and AR. Previous 6D pose retrieval pipelines treat the problem either as a regression task or discretize the pose space to classify. We change this paradigm and reformulate the problem as an action decision process where an initial pose is updated in incremental discrete steps that sequentially move a virtual 3D rendering towards the correct solution. A neural network estimates likely moves from a single RGB image iteratively and determines so an acceptable final pose. In comparison to other approaches that train object-specific pose models, we learn a decision process. This allows for a lightweight architecture while it naturally generalizes to unseen objects. A coherent stop action for process termination enables dynamic reduction of the computation cost if there are insignificant changes in a video sequence. Instead of a static inference time, we thereby automatically increase the runtime depending on the object motion. Robustness and accuracy of our action decision network are evaluated on Laval and YCB video scenes where we significantly improve the state-of-the-art.
翻译:对象表面估计是机器人视觉和 AR 的有机部分。 先前的 6D 构成回收管道将问题作为回归任务或将构成空间分离进行分类。 我们改变这个范式, 将问题重新表述为行动决定程序, 最初的构成以渐进的离散步骤更新, 并按顺序将虚拟的 3D 转换成正确的解决方案。 神经网络估计可能从单一的 RGB 图像迭接方式移动, 并由此确定一个可接受的最终构成。 与培训特定对象的构成模型的其他方法相比, 我们学习了一种决定程序。 这允许一个轻量结构, 而该结构自然地对看不见的物体进行概括化。 如果视频序列发生微小的变化, 程序终止的一致停止动作可以动态降低计算成本。 而不是静态的推论时间, 我们因此自动增加运行时间, 取决于物体运动。 我们行动网络的旋转和精确度和精确度在Laval 和 YCB 视频场段进行评估, 在那里我们大大改进了艺术的状态 。