The perception system in personalized mobile agents requires developing indoor scene understanding models, which can understand 3D geometries, capture objectiveness, analyze human behaviors, etc. Nonetheless, this direction has not been well-explored in comparison with models for outdoor environments (e.g., the autonomous driving system that includes pedestrian prediction, car detection, traffic sign recognition, etc.). In this paper, we first discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments, and other challenges such as fusion between heterogeneous sources of information (e.g., RGB images and Lidar point clouds), modeling relationships between a diverse set of outputs (e.g., 3D object locations, depth estimation, and human poses), and computational efficiency. Then, we describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges. MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks. We show that MMISM performs on par or even better than single-task models; e.g., we improve the baseline 3D object detection results by 11.7% on the benchmark ARKitScenes dataset.
翻译:个人化移动物剂的感知系统要求开发室内景象理解模型,这些模型能够理解3D地理偏差、捕捉客观性、分析人类行为等等。然而,与室外环境模型(例如包括行人预测、汽车探测、交通标志识别等在内的自主驾驶系统)相比,这一方向尚未很好探讨。 在本文件中,我们首先讨论主要挑战:真实世界室内环境的标签数据不足,甚至没有,以及其它挑战,如不同信息来源(例如RGB图像和利达尔点云)、不同产出(例如3D目标位置、深度估计和人造外表)和计算效率之间混合的模型关系。然后,我们描述MMISSM(超模-模式输入多式任务输出室内哨码理解模型)以应对上述挑战。 MISM将RGB图像以及稀疏的利达尔点视为投入,3D对象探测、深度完成、人造图估和语义分割等混合体等混合信息源(例如3D目标定位、3D目标、3D目标、深度估计和语义分割等模型)之间的模型,我们展示MISMISMS(MIS-C)在单一目标上的基线上改进了11MSBM(甚至改进了SB)的基线结果。