We propose a novel multi-task learning system that combines appearance and motion cues for a better semantic reasoning of the environment. A unified architecture for joint vehicle detection and motion segmentation is introduced. In this architecture, a two-stream encoder is shared among both tasks. In order to evaluate our method in autonomous driving setting, KITTI annotated sequences with detection and odometry ground truth are used to automatically generate static/dynamic annotations on the vehicles. This dataset is called KITTI Moving Object Detection dataset (KITTI MOD). The dataset will be made publicly available to act as a benchmark for the motion detection task. Our experiments show that the proposed method outperforms state of the art methods that utilize motion cue only with 21.5% in mAP on KITTI MOD. Our method performs on par with the state of the art unsupervised methods on DAVIS benchmark for generic object segmentation. One of our interesting conclusions is that joint training of motion segmentation and vehicle detection benefits motion segmentation. Motion segmentation has relatively fewer data, unlike the detection task. However, the shared fusion encoder benefits from joint training to learn a generalized representation. The proposed method runs in 120 ms per frame, which beats the state of the art motion detection/segmentation in computational efficiency.
翻译:我们提出一个新的多任务学习系统,将外观和运动提示结合起来,以更好地对环境进行语义推理。引入了车辆探测和运动分离的统一结构。在这个结构中,两个任务之间共享一个双流编码器。为了评估我们在自主驾驶环境中的方法,KITTI使用一个带有探测和观察测量地面真理的附加说明序列,自动生成车辆静态/动态说明。这个数据集称为KITTI移动物体探测数据集(KITTI MOD)。数据集将公开提供,作为运动探测任务的基准。我们的实验显示,拟议方法优于艺术方法状态,仅使用运动提示21.5%的MAP在KITTI MOD上。我们的方法与DAVIS通用物体分解基准上不受监督的先进方法相同。我们的一个有趣的结论是,运动分解和车辆探测探测数据集的联合培训有利于运动分解。与探测任务不同,运动分解将数据相对较少。但是,拟议的方法优于通用计算方法的通用计算方法,在120级计算中学习了通用计算方法。