Localizing individuals in crowds is more in accordance with the practical demands of subsequent high-level crowd analysis tasks than simply counting. However, existing localization based methods relying on intermediate representations (\textit{i.e.}, density maps or pseudo boxes) serving as learning targets are counter-intuitive and error-prone. In this paper, we propose a purely point-based framework for joint crowd counting and individual localization. For this framework, instead of merely reporting the absolute counting error at image level, we propose a new metric, called density Normalized Average Precision (nAP), to provide more comprehensive and more precise performance evaluation. Moreover, we design an intuitive solution under this framework, which is called Point to Point Network (P2PNet). P2PNet discards superfluous steps and directly predicts a set of point proposals to represent heads in an image, being consistent with the human annotation results. By thorough analysis, we reveal the key step towards implementing such a novel idea is to assign optimal learning targets for these proposals. Therefore, we propose to conduct this crucial association in an one-to-one matching manner using the Hungarian algorithm. The P2PNet not only significantly surpasses state-of-the-art methods on popular counting benchmarks, but also achieves promising localization accuracy. The codes will be available at: https://github.com/TencentYoutuResearch/CrowdCounting-P2PNet.
翻译:人群中个人本地化更符合随后高层次人群分析任务的实际要求,而不是简单的计数。然而,以中间表达(\ textit{i.e.}、密度地图或假框)作为学习目标的现有基于本地化方法,是反直觉的,容易出错。在本文中,我们提出了一个纯粹基于点的框架,用于联合人群计数和个人本地化。对于这个框架,我们不仅报告图像层面的绝对计数错误,而且提出一个新的衡量标准,称为密度标准化平均精度(nAP),以提供更全面和更精确的业绩评估。此外,我们在这个框架内设计了一个直观的解决方案,称为点对点网络(P2PNet)。P2PNet抛弃了多余的步骤,并直接预测了一组在图像中代表头的点提案,这与人类的批注结果是一致的。我们通过透彻的分析,揭示了实施这种新想法的关键步骤是为这些提案设定最佳学习目标。因此,我们提议以一对一对一的匹配方式来进行这一关键的联系,它被称为点对点网络网络(P2Net-C),使用匈牙利现有精确度的精确度计算方法。