In this paper, we propose a novel SpatioTemporal convolutional Dense Network (STDNet) to address the video-based crowd counting problem, which contains the decomposition of 3D convolution and the 3D spatiotemporal dilated dense convolution to alleviate the rapid growth of the model size caused by the Conv3D layer. Moreover, since the dilated convolution extracts the multiscale features, we combine the dilated convolution with the channel attention block to enhance the feature representations. Due to the error that occurs from the difficulty of labeling crowds, especially for videos, imprecise or standard-inconsistent labels may lead to poor convergence for the model. To address this issue, we further propose a new patch-wise regression loss (PRL) to improve the original pixel-wise loss. Experimental results on three video-based benchmarks, i.e., the UCSD, Mall and WorldExpo'10 datasets, show that STDNet outperforms both image- and video-based state-of-the-art methods. The source codes are released at \url{https://github.com/STDNet/STDNet}.
翻译:在本文中,我们建议建立一个小说SpatioTepioTemporal convolutional Dense网络(STDNet),以解决基于视频的人群计数问题,其中包括3D演化的分解和3D超超超时联变,以减缓Conv3D层造成模型规模的快速增长。此外,由于变形变化提取了多级特征,我们将变形变换与频道关注区块结合起来,以加强地貌表现。由于给人群贴标签的困难,特别是给视频、不精确或标准不一致的标签带来错误,可能导致模型的趋同差。为了解决这一问题,我们进一步提议一种新的补合式回归损失(PRL),以改善原像素值损失。三个视频基准的实验结果,即UCSD、Mall和WorldExpo'10数据集,显示STDNet超越了图像和基于视频的状态-TD-art方法。源代码在Net/Surgism/Squb_Surgis/TD-方法上发布。