Holistically understanding an object and its 3D movable parts through visual perception models is essential for enabling an autonomous agent to interact with the world. For autonomous driving, the dynamics and states of vehicle parts such as doors, the trunk, and the bonnet can provide meaningful semantic information and interaction states, which are essential to ensuring the safety of the self-driving vehicle. Existing visual perception models mainly focus on coarse parsing such as object bounding box detection or pose estimation and rarely tackle these situations. In this paper, we address this important autonomous driving problem by solving three critical issues. First, to deal with data scarcity, we propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images before reconstructing human-vehicle interaction (VHI) scenarios. Our approach is fully automatic without any human interaction, which can generate a large number of vehicles in uncommon states (VUS) for training deep neural networks (DNNs). Second, to perform fine-grained vehicle perception, we present a multi-task network for VUS parsing and a multi-stream network for VHI parsing. Third, to quantitatively evaluate the effectiveness of our data augmentation approach, we build the first VUS dataset in real traffic scenarios (e.g., getting on/out or placing/removing luggage). Experimental results show that our approach advances other baseline methods in 2D detection and instance segmentation by a big margin (over 8%). In addition, our network yields large improvements in discovering and understanding these uncommon cases. Moreover, we have released the source code, the dataset, and the trained model on Github (https://github.com/zongdai/EditingForDNN).
翻译:通过视觉感知模型全面理解物体及其3D可移动部分对于使自主代理商能够与世界互动至关重要。 对于自主驾驶来说,汽车部件(如门、后备箱和帽子网)的动态和状态可以提供有意义的语义信息和互动状态,这对于确保自行驾驶车辆的安全至关重要。现有视觉认知模型主要侧重于粗糙的分解,如物体捆绑盒检测或显示估计,而很少处理这些情况。在本文中,我们通过解决三个关键问题来解决这一重要的自主驱动问题。首先,为了解决数据稀缺,我们建议一个有效的培训数据生成程序,在重建载体互动(VHI)情景之前,将三维汽车部件与车辆的动态部件安装在实际图像中。我们的方法是完全自动的,而没有任何人际互动,这可以在不常见的状态(VUS)中产生大量车辆,用于培训深层神经网络(DNNS) 。第二,为了进行精细的车辆理解,我们为 VUS可添加多塔格网络,并且为VHI进行多流路路路路路路的升级网络。第三,经过对数据进行定量的升级,通过定量数据定位,在VHI的测试中进行数据定位/直观评估,用数据定位/直径对数据定位/直径对数据进行数据进行数据定位/直径对数据定位/直径。