Humans perceive and construct the surrounding world as an arrangement of simple parametric models. In particular, man-made environments commonly consist of volumetric primitives such as cuboids or cylinders. Inferring these primitives is an important step to attain high-level, abstract scene descriptions. Previous approaches directly estimate shape parameters from a 2D or 3D input, and are only able to reproduce simple objects, yet unable to accurately parse more complex 3D scenes. In contrast, we propose a robust estimator for primitive fitting, which can meaningfully abstract real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to 3D features, such as a depth map. We condition the network on previously detected parts of the scene, thus parsing it one-by-one. To obtain 3D features from a single RGB image, we additionally optimise a feature extraction CNN in an end-to-end manner. However, naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene behind. We thus propose an occlusion-aware distance metric correctly handling opaque scenes. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the challenging NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.
翻译:人类将周围的世界视为一个简单的参数模型的安排。 特别是, 人造环境通常由体积原始物组成, 如幼崽或圆柱体。 推断这些原始物是达到高层次抽象场景描述的重要一步。 先前的方法直接从 2D 或 3D 输入中估算形状参数, 只能复制简单对象, 却无法准确分析更复杂的 3D 场景。 相比之下, 我们提议为原始装配设置一个强健的测量仪, 它可以使用幼崽进行有意义的抽象的抽象真实世界环境。 由神经网络引导的RANSAC 测量仪将这些原始物与3D 特征相匹配, 如深度地图等。 我们将这些网络设置在先前检测到的场景部分, 从而逐个进行剖析。 为了从一个单一的 RGB 图像中获取 3D 特性, 我们进一步选择以端到端到端的方式提取一个特效的CNN 。 然而, 单纯地最小的点到顶端点到顶端的距离可以导致大型或尖锐的缩缩缩的缩缩缩图表层2 3 我们提议在后方的缩图图 。 要求正确进行这样的缩略的缩略的缩略图。