Most recent methods used for crowd counting are based on the convolutional neural network (CNN), which has a strong ability to extract local features. But CNN inherently fails in modeling the global context due to the limited receptive fields. However, the transformer can model the global context easily. In this paper, we propose a simple approach called CCTrans to simplify the design pipeline. Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information, a pyramid feature aggregation (PFA) model to combine low-level and high-level features, an efficient regression head with multi-scale dilated convolution (MDC) to predict density maps. Besides, we tailor the loss functions for our pipeline. Without bells and whistles, extensive experiments demonstrate that our method achieves new state-of-the-art results on several benchmarks both in weakly and fully-supervised crowd counting. Moreover, we currently rank No.1 on the leaderboard of NWPU-Crowd. Our code will be made available.
翻译:最近的人群计数方法基于具有很强提取本地特征能力的连锁神经网络(CNN ) 。 但是CNN在建模全球背景方面注定失败, 原因是可接收域有限。 但是, 变压器可以很容易地建模全球背景 。 在本文中, 我们提出一个简单的方法, 叫做 CCTrans 来简化设计管道。 具体地说, 我们使用金字塔的视觉变压器主干网来捕捉全球人群信息, 一个金字塔特征集合模型, 将低层次和高层次特征结合起来, 一个高效的回归头, 具有多尺度的扩展式共振动( MDC ) 来预测密度地图 。 此外, 我们为输油管定制了损失功能。 没有钟声和哨子, 广泛的实验表明我们的方法在几个基准上取得了新的最新效果, 包括弱和完全监控的人群计数。 此外, 我们目前将位于 NWPU rowd 的首板上排名第 1号。 我们的代码将会被提供 。