We present a novel approach for estimating depth from a monocular camera as it moves through complex and crowded indoor environments, e.g., a department store or a metro station. Our approach predicts absolute scale depth maps over the entire scene consisting of a static background and multiple moving people, by training on dynamic scenes. Since it is difficult to collect dense depth maps from crowded indoor environments, we design our training framework without requiring depths produced from depth sensing devices. Our network leverages RGB images and sparse depth maps generated from traditional 3D reconstruction methods to estimate dense depth maps. We use two constraints to handle depth for non-rigidly moving people without tracking their motion explicitly. We demonstrate that our approach offers consistent improvements over recent depth estimation methods on the NAVERLABS dataset, which includes complex and crowded scenes.
翻译:我们提出了一个新颖的方法来估计单筒照相机在复杂和拥挤的室内环境,如百货商店或地铁站中的深度。我们的方法预测整个场景的绝对深度地图,由静态背景和多动人员组成,进行动态场景培训。由于难以从拥挤的室内环境收集密集深度地图,我们设计了培训框架,而不需要从深深水测深设备中提取深度。我们的网络利用了传统三维重建方法产生的RGB图像和稀薄深度地图来估计密密密的深度地图。我们使用两个限制来处理非硬性移动人员的深度,而没有明确跟踪其运动。我们证明我们的方法对最近对NAVERLABS数据集的深度估算方法提供了一致的改进,该数据集包括复杂和拥挤的场面。