Recognizing scenes and objects in 3D from a single image is a longstanding goal of computer vision with applications in robotics and AR/VR. For 2D recognition, large datasets and scalable solutions have led to unprecedented advances. In 3D, existing benchmarks are small in size and approaches specialize in few object categories and specific domains, e.g. urban driving scenes. Motivated by the success of 2D recognition, we revisit the task of 3D object detection by introducing a large benchmark, called Omni3D. Omni3D re-purposes and combines existing datasets resulting in 234k images annotated with more than 3 million instances and 97 categories.3D detection at such scale is challenging due to variations in camera intrinsics and the rich diversity of scene and object types. We propose a model, called Cube R-CNN, designed to generalize across camera and scene types with a unified approach. We show that Cube R-CNN outperforms prior works on the larger Omni3D and existing benchmarks. Finally, we prove that Omni3D is a powerful dataset for 3D object recognition, show that it improves single-dataset performance and can accelerate learning on new smaller datasets via pre-training.
翻译:从单一图像中识别3D中的场景和对象是一个计算机视觉的长期目标,其应用为机器人和AR/VR。对于 2D 识别,大型数据集和可缩放的解决方案带来了前所未有的进步。在 3D 中,现有基准规模小,专门用于少数对象类别和特定领域(如城市驱动场景)的方法也小。由于2D 识别的成功,我们重新审视了3D对象探测的任务,引入了一个大型基准,称为Omni3D。 Omni3D 重新用途,并合并了现有的数据集,产生了234k图像,附加了300多万个实例和97.3D类的附加说明。在这种规模上的检测具有挑战性,因为相机内在特征的变化以及场景和对象类型的多样性。我们提出了一个名为Cube R-CNN的模型,旨在以统一的方法对相机和场景类型进行综合。我们显示,Cube R-CNN 超越了在更大的Omni3D和现有基准方面先前的工程。最后,我们证明Omni3D是3D 的较强大的数据数据集,能够加速进行3D 的3D 目标识别,显示它通过更新的学习,通过单一数据前的进度。