Perspective distortions and crowd variations make crowd counting a challenging task in computer vision. To tackle it, many previous works have used multi-scale architecture in deep neural networks (DNNs). Multi-scale branches can be either directly merged (e.g. by concatenation) or merged through the guidance of proxies (e.g. attentions) in the DNNs. Despite their prevalence, these combination methods are not sophisticated enough to deal with the per-pixel performance discrepancy over multi-scale density maps. In this work, we redesign the multi-scale neural network by introducing a hierarchical mixture of density experts, which hierarchically merges multi-scale density maps for crowd counting. Within the hierarchical structure, an expert competition and collaboration scheme is presented to encourage contributions from all scales; pixel-wise soft gating nets are introduced to provide pixel-wise soft weights for scale combinations in different hierarchies. The network is optimized using both the crowd density map and the local counting map, where the latter is obtained by local integration on the former. Optimizing both can be problematic because of their potential conflicts. We introduce a new relative local counting loss based on relative count differences among hard-predicted local regions in an image, which proves to be complementary to the conventional absolute error loss on the density map. Experiments show that our method achieves the state-of-the-art performance on five public datasets, i.e. ShanghaiTech, UCF_CC_50, JHU-CROWD++, NWPU-Crowd and Trancos.
翻译:视觉扭曲和人群变异使得人群在计算机视野中计数是一项具有挑战性的任务。 为了解决这个问题, 许多先前的作品在深神经网络中使用了多尺度结构。 多尺度分支可以直接合并( 连接), 或者通过 DNN 的代理人( 例如注意力) 的指导进行合并。 尽管这些组合方法很普遍, 但这些组合方法不够复杂, 不足以解决多尺度密度地图上每像素性能差异的问题。 在这项工作中, 我们重新设计了多尺度神经网络, 引入了密度专家的等级组合, 将多尺度密度图按等级合并用于计数。 在等级结构中, 专家竞争和合作计划可以直接合并( 例如连接), 或者通过 DNNNNNNNN。 软格网可以直接合并。 尽管这些组合方法很普遍, 但这些组合方法不足以解决每像素性表现差异。 网络使用人群密度地图和本地计数地图, 后者通过本地融合获得。 优化这两个网络可能存在问题, 因为他们的密度多级组合地图上多级的密度地图 。 在各种等级结构结构结构结构中, 我们引入了一种平级的缩缩缩缩缩缩缩缩缩缩图, 。