Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images -- but still suffer limitations as they leverage either human-curated image datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, inspired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object-centric dataset containing over 520K images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.
翻译:从传感器数据中对3D世界进行建模以进行模拟是开发自动驾驶等机器人学习问题的测试和验证环境的可扩展方式。然而,手动创建或重新创建类似于真实世界的环境是困难的、昂贵的,而且不可扩展。最近的生成模型技术已经显示出了有望解决这些挑战的进展,它使用丰富的2D图像学习3D资产,但仍然存在限制,因为它们要么利用人类策划的图像数据集,要么利用手动创建的综合3D环境的渲染结果。在本文中,我们介绍了GINA-3D,它是一种生成模型,使用来自相机和激光雷达传感器的真实世界驾驶数据,创造了多样化的车辆和行人真实3D隐式神经资产。与现有的图像数据集相比,真实驾驶环境由于遮挡、光照变化和长尾分布而带来了新的挑战。GINA-3D通过将表示学习和生成建模分为两个阶段,并使用一种学习的三平面潜变量结构,受到图像生成的最新进展的启示,来解决这些挑战。为了评估我们的方法,我们构建了一个大规模的以对象为中心的数据集,其中包含来自Waymo Open数据集的超过520K张车辆和行人图像和一个新的包含80K张长尾实例的图像集,例如建筑设备、垃圾车和缆车。我们将我们的模型与现有方法进行了比较,并证明它在生成的图像和几何方面达到了最先进的性能。