In the current worldwide situation, pedestrian detection has reemerged as a pivotal tool for intelligent video-based systems aiming to solve tasks such as pedestrian tracking, social distancing monitoring or pedestrian mass counting. Pedestrian detection methods, even the top performing ones, are highly sensitive to occlusions among pedestrians, which dramatically degrades their performance in crowded scenarios. The generalization of multi-camera set-ups permits to better confront occlusions by combining information from different viewpoints. In this paper, we present a multi-camera approach to globally combine pedestrian detections leveraging automatically extracted scene context. Contrarily to the majority of the methods of the state-of-the-art, the proposed approach is scene-agnostic, not requiring a tailored adaptation to the target scenario\textemdash e.g., via fine-tunning. This noteworthy attribute does not require \textit{ad hoc} training with labelled data, expediting the deployment of the proposed method in real-world situations. Context information, obtained via semantic segmentation, is used 1) to automatically generate a common Area of Interest for the scene and all the cameras, avoiding the usual need of manually defining it; and 2) to obtain detections for each camera by solving a global optimization problem that maximizes coherence of detections both in each 2D image and in the 3D scene. This process yields tightly-fitted bounding boxes that circumvent occlusions or miss-detections. Experimental results on five publicly available datasets show that the proposed approach outperforms state-of-the-art multi-camera pedestrian detectors, even some specifically trained on the target scenario, signifying the versatility and robustness of the proposed method without requiring ad-hoc annotations nor human-guided configuration.
翻译:在目前全球形势下,行人探测再次成为智能视频系统的关键工具,目的是解决行人跟踪、社会失常监测或行人质量计数等任务。 徒步探测方法,即使是顶级表演方法,对于行人之间的隔离非常敏感,在拥挤的情景下,这大大降低了行人之间的隔离性。 多摄像集集集的普及性允许通过综合不同观点的信息来更好地应对隔离性。 在本文中,我们展示了一种多摄像学方法,将行人探测利用自动抽取的场景环境在全球整合起来。 与最新工艺的大多数方法相对,拟议的方法是场面不可知性方法,不需要通过微调来适应行人之间的目标情景。 多摄像仪的概括性一般不要求用固定数据来理解,加快拟议方法在现实环境中的部署速度。 通过语义分解, 使用背景信息, 用于自动生成一个共同的行距定位目标区域, 用于现场和所有手动图像的精确度, 显示每个摄像头的准确性, 显示每个路面的准确性, 显示每个摄像头的准确性, 显示, 需要最精确的频率 和最精确的图像,, 显示每个摄像头的精确的频率 显示, 显示每个摄像头的频率 显示, 显示, 显示, 显示每个摄像头的频率的频率 显示, 显示, 显示, 需要 最精确性, 和最精确性, 显示每个摄像头的频率 显示, 显示的频率的频率 和最精确性, 显示, 需要 。