Modern methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak smoothness constraints across consecutive frames. In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case.
翻译:在拥挤的场景中,对人进行统计的现代方法依靠深厚的网络来估计个人图像的密度。 因此,只有极少数人利用视频序列的时间一致性,而那些仅仅在连续的框框中造成微弱的平滑限制。 在本文中,我们主张估计人们在连续的图像之间流过图像位置,从这些图像流中推断出人群的密度,而不是直接退缩。这使我们能够施加更强得多的限制,将保护人口的数量编码起来。因此,它大大提升了绩效,而不需要更复杂的结构。此外,它还使我们能够利用人的流动与光学流之间的关系来进一步改进结果。我们还表明,以空间和时间的方式利用人们的节能限制,可以以少得多的说明在积极的学习环境中培养深层的人群计数模式。这大大减少了记号成本,同时仍然导致与全面监督案例类似的表现。