Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit long-range correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi-focus mechanism. Based on the multi-focus GNA, we develop a unified neural network called GNANet to accurately locate head centers in video clips by fully aggregating spatial-temporal information via a scene modeling module and a context cross-attention module. Moreover, to facilitate future researches in this field, we introduce a large-scale crowd video benchmark named VSCrowd, which consists of 60K+ frames captured in various surveillance scenarios and 2M+ head annotations. Finally, we conduct extensive experiments on three datasets including our SenseCrowd, and the experiment results show that the proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.
翻译:视频人群本地化是一项关键但具有挑战性的任务,目的是在特定拥挤视频中估计人头的确切位置。为了模拟人类流动的空间-时际依赖性,我们建议多重点高斯社区关注(GNA),这可以有效地利用远程通信,同时保持输入视频的空间地形结构。特别是,我们的GNA还可以利用设备齐全的多重点机制捕捉人头的大小变化。基于多重点GNA,我们开发了一个称为GNANet的统一神经网络,通过场景建模模块和背景交叉注意模块,将空间-时际信息充分汇总到视频剪辑中准确定位中心。此外,为了便利该领域的未来研究,我们引入了一个名为VSCrowd的大型人群视频基准,由在各种监视情景中捕捉的60K+框架和2M+头部说明组成。最后,我们在三个数据集(包括我们的SenseCrowd)上进行了广泛的实验,实验结果显示,拟议的方法能够实现视频本地化和计票的状态。