Multi-view crowd counting has been previously proposed to utilize multi-cameras to extend the field-of-view of a single camera, capturing more people in the scene, and improve counting performance for occluded people or those in low resolution. However, the current multi-view paradigm trains and tests on the same single scene and camera-views, which limits its practical application. In this paper, we propose a cross-view cross-scene (CVCS) multi-view crowd counting paradigm, where the training and testing occur on different scenes with arbitrary camera layouts. To dynamically handle the challenge of optimal view fusion under scene and camera layout change and non-correspondence noise due to camera calibration errors or erroneous features, we propose a CVCS model that attentively selects and fuses multiple views together using camera layout geometry, and a noise view regularization method to train the model to handle non-correspondence errors. We also generate a large synthetic multi-camera crowd counting dataset with a large number of scenes and camera views to capture many possible variations, which avoids the difficulty of collecting and annotating such a large real dataset. We then test our trained CVCS model on real multi-view counting datasets, by using unsupervised domain transfer. The proposed CVCS model trained on synthetic data outperforms the same model trained only on real data, and achieves promising performance compared to fully supervised methods that train and test on the same single scene.
翻译:先前曾提议多视图人群计数, 以便利用多镜头扩大单一相机的视野, 捕捉更多人到现场, 并改进隐蔽人群或低分辨率人群的计数性能。 然而, 目前的多视图范式火车和在同一单一场景和相机视图上进行测试, 限制了其实际应用。 在本文中, 我们提议了一个跨视图交叉屏幕( CVCS) 多视图人群计数模式, 该模式的培训和测试发生在带有任意相机布局的不同场景上。 为了动态地处理由于相机校准错误或错误而导致的场景和相机布局变化以及非焦固度噪音的最佳视图集成的挑战。 我们提议了一个多视图模式, 利用相机布局的几何形状来仔细选择和整合多个观点。 我们提议了一个声音校正的CVCS模型来训练模型, 来捕捉许多可能的场景和相机, 用来比较真实的轨迹变异, 用来比较经过训练的CS格式数据。 我们提议用经过全面测试的CS格式, 测试的模型, 来训练的大型数据。