We present Mix3D, a data augmentation technique for segmenting large-scale 3D scenes. Since scene context helps reasoning about object semantics, current works focus on models with large capacity and receptive fields that can fully capture the global context of an input 3D scene. However, strong contextual priors can have detrimental implications like mistaking a pedestrian crossing the street for a car. In this work, we focus on the importance of balancing global scene context and local geometry, with the goal of generalizing beyond the contextual priors in the training set. In particular, we propose a "mixing" technique which creates new training samples by combining two augmented scenes. By doing so, object instances are implicitly placed into novel out-of-context environments and therefore making it harder for models to rely on scene context alone, and instead infer semantics from local structure as well. We perform detailed analysis to understand the importance of global context, local structures and the effect of mixing scenes. In experiments, we show that models trained with Mix3D profit from a significant performance boost on indoor (ScanNet, S3DIS) and outdoor datasets (SemanticKITTI). Mix3D can be trivially used with any existing method, e.g., trained with Mix3D, MinkowskiNet outperforms all prior state-of-the-art methods by a significant margin on the ScanNet test benchmark 78.1 mIoU. Code is available at: https://nekrasov.dev/mix3d/
翻译:我们展示了Mix3D数据增强技术, 用于分割大型 3D 场景的数据增强技术。 由于现场背景有助于物体语义学的推理, 目前的工作侧重于具有大容量和可接收场域的模型, 能够完全捕捉输入 3D 场景的全球背景。 但是, 强大的背景前科可能会产生有害影响, 比如误用行人穿越街道购买汽车。 在这项工作中, 我们侧重于平衡全球场景背景和地方几何学的重要性, 目标是在培训集中超越背景前段进行推广。 特别是, 我们提议了一种“ 混合” 技术, 通过结合两个放大场景来创建新的培训样本。 通过这样做, 对象实例被隐含地放置在新颖的解码环境外, 从而使得模型更难独自依赖场景环境, 而不是从当地结构中推断出语义。 我们进行详细分析, 以了解全球背景、 本地结构以及混合场景的影响。 在实验中, 我们显示, 以 Mix3D 培训的模型从室内( ScanNet, S3D) 和户外数据集中的重要性平流推力推力推力(Semanik past ) 方法, 。