We study video crowd counting, which is to estimate the number of objects (people in this paper) in all the frames of a video sequence. Previous work on crowd counting is mostly on still images. There has been little work on how to properly extract and take advantage of the spatial-temporal correlation between neighboring frames in both short and long ranges to achieve high estimation accuracy for a video sequence. In this work, we propose Monet, a novel and highly accurate motion-guided non-local spatial-temporal network for video crowd counting. Monet first takes people flow (motion information) as guidance to coarsely segment the regions of pixels where a person may be. Given these regions, Monet then uses a non-local spatial-temporal network to extract spatial-temporally both short and long-range contextual information. The whole network is finally trained end-to-end with a fused loss to generate a high-quality density map. Noting the scarcity and low quality (in terms of resolution and scene diversity) of the publicly available video crowd datasets, we have collected and built a large-scale video crowd counting datasets, VidCrowd, to contribute to the community. VidCrowd contains 9,000 frames of high resolution (2560 x 1440), with 1,150,239 head annotations captured in different scenes, crowd density and lighting in two cities. We have conducted extensive experiments on the challenging VideoCrowd and two public video crowd counting datasets: UCSD and Mall. Our approach achieves substantially better performance in terms of MAE and MSE as compared with other state-of-the-art approaches.
翻译:我们研究视频人群计数, 目的是估算视频序列所有框架的天体数( 本文中的人) 。 先前关于人群计数的工作大多是在静止图像上进行 。 在如何正确提取和利用短距离和长距离相邻框架之间的空间时空相关性以达到视频序列的高估精度方面, 几乎没有做任何工作。 在这项工作中, 我们提出一个新颖和高度精确的运动引导非本地空间时空网络, 用于视频人群计数。 Monet 首先将人流( 动作信息) 用作人流( 动作信息) 的指南, 指导可能存在一个人的像素区域。 鉴于这些区域, Monet然后使用一个非本地空间时空网络来提取短距离和长距离的相邻框架之间的空间时空相关性关系。 整个网络最终经过培训, 以精密损失来生成高质量的密度地图。 注意到公开提供的视频人群数据集的稀缺性和低质量( 解析和场多样性), 我们收集并构建了大型视频群流群段部分, 将高比例的图像标值标值标值, 。