Perceiving 3D objects from monocular inputs is crucial for robotic systems, given its economy compared to multi-sensor settings. It is notably difficult as a single image can not provide any clues for predicting absolute depth values. Motivated by binocular methods for 3D object detection, we take advantage of the strong geometry structure provided by camera ego-motion for accurate object depth estimation and detection. We first make a theoretical analysis on this general two-view case and notice two challenges: 1) Cumulative errors from multiple estimations that make the direct prediction intractable; 2) Inherent dilemmas caused by static cameras and matching ambiguity. Accordingly, we establish the stereo correspondence with a geometry-aware cost volume as the alternative for depth estimation and further compensate it with monocular understanding to address the second problem. Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon. We also present a pose-free DfM to make it usable when the camera pose is unavailable. Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark. Detailed quantitative and qualitative analyses also validate our theoretical conclusions. The code will be released at https://github.com/Tai-Wang/Depth-from-Motion.
翻译:从单向输入的单向三维天体对机器人系统至关重要,因为其经济与多传感器设置相比,它具有经济性;由于单一图像无法提供预测绝对深度值的任何线索,因此尤其困难。我们借助3D天体探测的望远镜方法,利用相机自我移动提供的强强力几何结构,进行精确天体深度估计和探测。我们首先对一般双视图案例进行理论分析,并发现两个挑战:1) 使直接预测难以实现的多重估计的累积错误;2) 静态相机和相匹配的模糊性造成的固有困境。因此,我们用测深成本量法建立音频通信,作为深度估计的替代方法,并进一步用单向理解来补偿。我们的框架,即名为“DfM”的深度,然后使用既定的几何测量法将2D图像特性提升到3D空间,然后检测3D天体物体。我们还提出一个无型 DfM,以便在摄像机无法使用时加以利用。我们的框架超越了定位仪表的状态,以测测算成本大小,用高的模型分析。在KIT/TAQSDIal-DFI 上,还将一个大基分析。