Understanding the ambient scene is imperative for several applications such as autonomous driving and navigation. While obtaining real-world image data with per-pixel labels is challenging, existing accurate synthetic image datasets primarily focus on indoor spaces with fixed lighting and scene participants, thereby severely limiting their application to outdoor scenarios. In this work we introduce OmniHorizon, a synthetic dataset with 24,335 omnidirectional views comprising of a broad range of indoor and outdoor spaces consisting of buildings, streets, and diverse vegetation. Our dataset also accounts for dynamic scene components including lighting, different times of a day settings, pedestrians, and vehicles. Furthermore, we also demonstrate a learned synthetic-to-real cross-domain inference method for in-the-wild 3D scene depth and normal estimation method using our dataset. To this end, we propose UBotNet, an architecture based on a UNet and a Bottleneck Transformer, to estimate scene-consistent normals. We show that UBotNet achieves significantly improved depth accuracy (4.6%) and normal estimation (5.75%) compared to several existing networks such as U-Net with skip-connections. Finally, we demonstrate in-the-wild depth and normal estimation on real-world images with UBotNet trained purely on our OmniHorizon dataset, showing the promise of proposed dataset and network for scene understanding.
翻译:了解环境场景对于自主驾驶和导航等多种应用而言至关重要。 在获得使用像素标签的真实世界图像数据的同时,现有的准确合成图像数据集具有挑战性,而现有的准确合成图像数据集主要侧重于有固定照明和场景参与者的室内空间,从而严重限制其应用到户外情景。在这项工作中,我们引入了OmniHorizon,这是一个合成数据集,包含24,335个全向观点,包括广泛的室内和室外空间,包括建筑物、街道和不同的植被。我们的数据集还记录了动态场景组件,包括照明、白天设置的不同时间、行人和车辆等。此外,我们还展示了一种以智能3D场深度和正常估算方法为主的合成到真实的交叉推断方法。为此,我们提出了UBottNet,一个以UNet和Bottleneck 变异器为基础的结构,以估计场景相兼容的正常。我们显示UBottNet的深度精确度(4.6%)和正常估计(5.75 %),而我们所培训的图像网络最终展示了我们所了解的正常深度数据,例如UNet-Net-Hlock-lock-lock-lock-lock。