Incorporating multiple camera views for detection alleviates the impact of occlusions in crowded scenes. In a multiview system, we need to answer two important questions when dealing with ambiguities that arise from occlusions. First, how should we aggregate cues from the multiple views? Second, how should we aggregate unreliable 2D and 3D spatial information that has been tainted by occlusions? To address these questions, we propose a novel multiview detection system, MVDet. For multiview aggregation, existing methods combine anchor box features from the image plane, which potentially limits performance due to inaccurate anchor box shapes and sizes. In contrast, we take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane (bird's eye view). To resolve any remaining spatial ambiguity, we apply large kernel convolutions on the ground plane feature map and infer locations from detection peaks. Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset, outperforming the state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a newly introduced synthetic dataset, MultiviewX, which allows us to control the level of occlusion. Code and MultiviewX dataset are available at https://github.com/hou-yz/MVDet.
翻译:包含多个摄像视图以探测拥挤的场景的影响。 在多视图系统中,我们需要回答两个重要问题, 处理隐蔽的模棱两可时需要回答两个重要问题。 首先, 我们应如何从多重观点中汇总线索? 第二, 我们应如何将不可靠的 2D 和 3D 空间信息汇总到地名录中去? 为了解决这些问题, 我们提议了一个新型的多视图探测系统 MVDet。 对于多视图汇总, 现有方法将图像平面的锚框功能组合在一起, 这可能由于不准确的锚框形状和大小而限制性能。 相反, 我们采取无锚方法, 通过在地面平面上投放地特征图( 鸟眼视图) 来汇总多视图信息。 要解决任何剩余的空间模糊性, 我们应在地面平面特征图中应用大型的磁圈变异; 从探测峰值中推断出位置。 我们的整个模型是端到端的, 并在标准野轨数据集上实现88.2% MINDRIVX 和MIVX 级数据控制系统。 我们还提供详细分析, 在最新的合成MVDLVD/ X 。