We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS). Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters. First, to deal with multi-view data scarcity, we propose a part-assisted novel view synthesis algorithm for data augmentation. We train a part-based texture inpainting network in a self-supervised manner. Then we render the textured model into the background image with the target 6-DoF pose. Second, to handle various camera parameters, we present a new method that produces dense mappings between image pixels and 3D points to perform robust 2D/3D vehicle parsing. Third, we build the first CVIS dataset for benchmarking, which annotates more than 1540 images (14017 instances) from real-world traffic scenarios. We combine these novel algorithms and datasets to develop a robust approach for 2D/3D vehicle parsing for CVIS. In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation, by 4.5%, 4.3%, and 2.9%, respectively. More details and results are included in the supplement. To facilitate future research, we will release the source code and the dataset on GitHub.
翻译:我们提出了一种新颖的方法,以强力地探测和认识不同相机的车辆,将其作为合作车辆基础设施系统的一部分。我们的配方是为任意的相机视图设计的,对内在参数或外部参数不作任何假设。首先,为了处理多视图数据稀缺问题,我们提出一个半辅助的新版视图合成算法,用于数据扩增。我们以自我监督的方式培训一个基于部分纹理的油漆网络。然后,我们将纹理模型与目标6-DoF的显示相制成背景图象。第二,为了处理各种相机参数,我们提出了一种新的方法,在图像像素和3D点之间进行密集的绘图,以进行强力的2D/3D车辆分辨。第三,我们为基准设计了第一个CVIS数据集,其中的注解超过实际世界交通假设的1540图象(14017例)。我们将这些新的算法和数据集结合起来,为CVIS提供2D/3D车辆剖面图。在实践中,我们的方法比SOTA方法高,在图像像像素类和3D分解2D、更多分解、4.5和今后的数据结果将分别纳入我们4.5和4.5和4.5的资料来源。