In real-world crowd counting applications, the crowd densities in an image vary greatly. When facing with density variation, human tend to locate and count the target in low-density regions, and reason the number in high-density regions. We observe that CNN focus on the local information correlation using a fixed-size convolution kernel and the Transformer could effectively extract the semantic crowd information by using the global self-attention mechanism. Thus, CNN could locate and estimate crowd accurately in low-density regions, while it is hard to properly perceive density in high-density regions. On the contrary, Transformer, has a high reliability in high-density regions, but fails to locate the target in sparse regions. Neither CNN or Transformer can well deal with this kind of density variations. To address this problem, we propose a CNN and Transformer Adaptive Selection Network (CTASNet) which can adaptively select the appropriate counting branch for different density regions. Firstly, CTASNet generates the prediction results of CNN and Transformer. Then, considering that CNN/Transformer are appropriate for low/high-density regions, a density guided Adaptive Selection Module is designed to automatically combine the predictions of CNN and Transformer. Moreover, to reduce the influences of annotation noise, we introduce a Correntropy based Optimal Transport loss. Extensive experiments on four challenging crowd counting datasets have validated the proposed method.
翻译:在现实世界的人群计数应用中,图像中的人群密度差异很大。在面对密度变化时,人们倾向于在低密度地区定位和计数目标,并给出高密度地区数量的原因。我们观察到有线电视新闻网利用固定规模的 convolution内核和变异器来关注本地信息关系,通过使用全球自省机制,可以有效地提取语义人群信息。因此,有线电视新闻网可以在低密度地区准确定位和估计人群,同时很难正确察觉高密度地区的密度。相反,变异器在高密度地区具有高度可靠性,但却无法在稀少地区定位目标。有线电视新闻网或变异器都无法应对这种密度变化。为解决这一问题,我们建议建立有线电视新闻网和变异适应性选择不同密度区域的适当计数分支。首先,CTASNet可以生成CNN和变异式区域的预测结果。随后,考虑到CNN/变异器在低/高密度地区具有高度可靠性,但无法将目标定位在稀少地区。CNNNCM或变异性超高密度地区都无法找到目标位置。为了应对这种密度变异性变异性变现,因此,ICNISNMRMICNS变异性变异性变异性变的计算方法可以将一个基于一个基于高/高/高压性变压变压变的模型的模型的模型。