We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix, making our architecture feasible on current hardware, both in terms of memory and computation. We attach a DETR-style head on top of the 3D feature grid in order to detect the objects in the scene and to predict their 3D pose and 3D shape. Compared to previous methods, our architecture is single stage, end-to-end trainable, and it can reason holistically about a scene from multiple video frames without needing a brittle tracking step. We evaluate our method on the challenging Scan2CAD dataset, where we outperform (1) recent state-of-the-art methods for 3D object pose estimation from RGB videos; and (2) a strong alternative method combining Multi-view Stereo with RGB-D CAD alignment. We plan to release our source code.
翻译:我们从 RGB 视频中提议一个基于变压器的神经网络结构,用于多点 3D 重建。 它依靠两种替代方法来代表它的知识: 以三维地物组成的全球格和一系列特定视图的 2D 格子。 我们逐步在两者之间交流信息, 并有一个专门的双向关注机制。 我们利用关于图像形成过程的知识来大大地拉紧关注重量矩阵, 使目前的硬件在记忆和计算上都能够使用我们的结构。 我们把一个DETR 风格的头放在 3D 地物格上方, 以探测现场的物体并预测其3D 形状和 3D 形状。 与以往的方法相比, 我们的结构是单一的, 端到端, 并且可以整体地从多个视频框中解释场景, 不需要微小的跟踪步骤。 我们评估了我们关于具有挑战性的 Scand2CAD 数据集的方法, 在那里我们超越了(1) 3D 对象的最新状态设计方法, 以RGB 视频的形式进行估计; 和 (2) 一个强大的替代方法, 将多视角 Stetoveo 与 RGB- D 和 RGB- D CAD 源校正 计划结合起来。