用变换器推动人群计数 (Boosting Crowd Counting with Transformers)

Significant progress on the crowd counting problem has been achieved by integrating larger context into convolutional neural networks (CNNs). This indicates that global scene context is essential, despite the seemingly bottom-up nature of the problem. This may be explained by the fact that context knowledge can adapt and improve local feature extraction to a given scene. In this paper, we therefore investigate the role of global context for crowd counting. Specifically, a pure transformer is used to extract features with global information from overlapping image patches. Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches throughout transformer layers. Due to the fact that transformers do not explicitly model the tried-and-true channel-wise interactions, we propose a token-attention module (TAM) to recalibrate encoded features through channel-wise attention informed by the context token. Beyond that, it is adopted to predict the total person count of the image through regression-token module (RTM). Extensive experiments demonstrate that our method achieves state-of-the-art performance on various datasets, including ShanghaiTech, UCF-QNRF, JHU-CROWD++ and NWPU. On the large-scale JHU-CROWD++ dataset, our method improves over the previous best results by 26.9% and 29.9% in terms of MAE and MSE, respectively.

翻译：通过将大背景纳入进化神经网络(CNNs),在人群计数问题上取得了重大进展。这表明,尽管问题似乎具有自下而上的性质,但全球场景背景至关重要,因为背景知识可以适应并改进本地特征的提取到特定场景。因此,在本文件中,我们调查了全球背景对人群计数的作用。具体地说,使用纯变压器从相重叠的图像处提取具有全球信息的特征。受分类的启发,我们为输入序列添加了一个上下文符号,以便利与整个变压层图像补丁相对应的标识进行信息交流。由于变压器没有明确地模拟经过试验和经过测试的频道互动,因此我们提议了一个象征性注意模块(TAM),通过上下文符号告知的频道注意来重新校正编码特征。除此之外,我们采用了一个纯变压变压器,通过回归-制模块(RTMM)来预测图像的总人数。广泛的实验表明,我们的方法在各种数据集(包括上海-THE-ROF-Q、U-ROF-DM-Q 和US-G-DMQ) 的大规模数据术语中,通过上上上上、U-ROC-ROF-R-D-D-DMC-DM-DMS-RS-DMS-S-B-B-S-S-S-S-BMSBMS-S-S-S-S-S-S-B-B-B-B-B-B-SB-B-B-SBMMM-S-S-BMT-B-B-BM-SB-SB-SBAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR) 和BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BMDMTMMMMMMMMMMMMMT