We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene, without the BEV image and the calibration of the FPVs, while the output is a unified plane with the localization and orientation of both the subjects and cameras in a BEV. We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts: i) creating a view-transform subject detection module to transform the FPV to a virtual BEV including localization and orientation of each pedestrian, ii) deriving a geometric transformation based method to estimate camera localization and view direction, i.e., the camera registration in a unified BEV, iii) making use of spatial and appearance information to aggregate the subjects into the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for evaluation. The experimental results show the remarkable effectiveness of our proposed method.
翻译:我们解决了一个在鸟眼视图(BEV)中进行多视相机和主题登记的新问题,而没有预先设定相机校准,这是一个非常具有挑战性的问题,因为其唯一的输入是多人场景中从不同第一人视角(FPV)获得的几张RGB图像,没有BEV图像和FPV校准,而产出是一个统一的平面,其对象和相机在BEV中都有定位和定向。我们提出了一个解决问题的端对端框架,其主要想法可以分为以下几个部分:(一) 建立一个视图变形主题探测模块,将FPV转换成虚拟BEV,包括每个行人的位置和方向,(二) 得出基于几何转换的方法,以估计相机的定位和视图方向,即,在统一的BEV中进行摄影登记,(三) 利用空间和外观信息将对象汇总到统一的BEV。我们收集了一个新的大型合成数据集,并有丰富的评估说明。实验结果显示我们拟议方法的显著效力。