Crowd localization, predicting head positions, is a more practical and high-level task than simply counting. Existing methods employ pseudo-bounding boxes or pre-designed localization maps, relying on complex post-processing to obtain the head positions. In this paper, we propose an elegant, end-to-end Crowd Localization Transformer named CLTR that solves the task in the regression-based paradigm. The proposed method views the crowd localization as a direct set prediction problem, taking extracted features and trainable embeddings as input of the transformer-decoder. To reduce the ambiguous points and generate more reasonable matching results, we introduce a KMO-based Hungarian matcher, which adopts the nearby context as the auxiliary matching cost. Extensive experiments conducted on five datasets in various data settings show the effectiveness of our method. In particular, the proposed method achieves the best localization performance on the NWPU-Crowd, UCF-QNRF, and ShanghaiTech Part A datasets.
翻译:预测头部位置的人群本地化比简单的计数更实际、更高级的任务。 现有方法使用假嵌入框或预先设计的本地化地图,依靠复杂的后处理获得头部位置。 在本文中,我们建议使用名为 CLTR 的优雅、端到端的人群本地化变异器来解决基于回归模式的任务。 拟议的方法将人群本地化视为直接设定的预测问题, 将提取的特征和可训练的嵌入作为变压器- 解码器的投入。 为了减少模糊点并产生更合理的匹配结果, 我们引入了基于 KMO 的匈牙利匹配器, 将附近环境作为辅助匹配成本。 在各种数据环境中对五个数据集进行的广泛实验显示了我们的方法的有效性。 特别是, 拟议的方法在 NWPU- Crowd、 UCF- QNRF 和上海科技 A Part A 数据集上取得了最佳本地化表现。