Being data-driven is one of the most iconic properties of deep learning algorithms. The birth of ImageNet drives a remarkable trend of "learning from large-scale data" in computer vision. Pretraining on ImageNet to obtain rich universal representations has been manifested to benefit various 2D visual tasks, and becomes a standard in 2D vision. However, due to the laborious collection of real-world 3D data, there is yet no generic dataset serving as a counterpart of ImageNet in 3D vision, thus how such a dataset can impact the 3D community is unraveled. To remedy this defect, we introduce MVImgNet, a large-scale dataset of multi-view images, which is highly convenient to gain by shooting videos of real-world objects in human daily life. It contains 6.5 million frames from 219,188 videos crossing objects from 238 classes, with rich annotations of object masks, camera parameters, and point clouds. The multi-view attribute endows our dataset with 3D-aware signals, making it a soft bridge between 2D and 3D vision. We conduct pilot studies for probing the potential of MVImgNet on a variety of 3D and 2D visual tasks, including radiance field reconstruction, multi-view stereo, and view-consistent image understanding, where MVImgNet demonstrates promising performance, remaining lots of possibilities for future explorations. Besides, via dense reconstruction on MVImgNet, a 3D object point cloud dataset is derived, called MVPNet, covering 87,200 samples from 150 categories, with the class label on each point cloud. Experiments show that MVPNet can benefit the real-world 3D object classification while posing new challenges to point cloud understanding. MVImgNet and MVPNet will be publicly available, hoping to inspire the broader vision community.
翻译:以数据为动力驱动数据是深层学习算法的最有标志性的特性之一。 图像Net的诞生促使了计算机视觉中“ 从大规模数据中学习”的显著趋势。 在图像Net上为获取丰富的通用图像显示显示显示为有益于各种 2D 视觉任务,并成为 2D 视觉的标准。 但是,由于大量收集真实世界 3D 数据, 还没有通用数据集作为3D 视觉中图像网络的对应方, 从而这样的数据集如何能够对 3D 目标社区产生影响。 为了纠正这一缺陷, 我们引入了 MVIMP 网络, 是一个包含多视图图像图像图像图像的大型数据集。 它包含来自 238 类的219 188 视频交叉物体的650万个框架, 包含丰富的物体遮罩、 摄像参数和点云。 多视图将我们的数据集与 3D 观测信号解开, 使得它成为2D 和 3D 目标社区之间的软桥梁。 我们进行了实验性研究, 通过摄像机的推算到真实点, 将MVIMVD 20 图像的图像的重建, 展示 3D 显示 3D 的图像的实地数据 展示 。</s>