Pedestrian detection in crowd scenes poses a challenging problem due to the heuristic defined mapping from anchors to pedestrians and the conflict between NMS and highly overlapped pedestrians. The recently proposed end-to-end detectors(ED), DETR and deformable DETR, replace hand designed components such as NMS and anchors using the transformer architecture, which gets rid of duplicate predictions by computing all pairwise interactions between queries. Inspired by these works, we explore their performance on crowd pedestrian detection. Surprisingly, compared to Faster-RCNN with FPN, the results are opposite to those obtained on COCO. Furthermore, the bipartite match of ED harms the training efficiency due to the large ground truth number in crowd scenes. In this work, we identify the underlying motives driving ED's poor performance and propose a new decoder to address them. Moreover, we design a mechanism to leverage the less occluded visible parts of pedestrian specifically for ED, and achieve further improvements. A faster bipartite match algorithm is also introduced to make ED training on crowd dataset more practical. The proposed detector PED(Pedestrian End-to-end Detector) outperforms both previous EDs and the baseline Faster-RCNN on CityPersons and CrowdHuman. It also achieves comparable performance with state-of-the-art pedestrian detection methods. Code will be released soon.
翻译:由于从锚到行人和高度重叠的行人之间的冲突,在人群场景中,人们对行人进行了定义松散的测绘,由此产生了一个具有挑战性的问题。最近提议的端到端探测器、DETR和变形的DETR,用变压器结构取代了诸如NMS和锚等手工设计的部件,这些部件通过计算查询之间所有对称的相互作用而摆脱了重复的预测。受这些工程的启发,我们探索了行人探测方面的表现。令人惊讶的是,与快速的RCNNNNN和高度重叠的行人之间的冲突相比,结果与COCO相悖。此外,由于在人群场景中有大量地面的真相数字,ED的双向匹配损害了培训效率。在这项工作中,我们确定了驱动ED的不良表现的基本动机,并提出了新的解码来解决这些问题。此外,我们设计了一个机制来利用专门用于ED的不那么隐蔽的行人可见的行人行人部分,并实现进一步的改进。还引入了更快的双向匹配算法,使ED培训更切合实际。拟议的探测器和越行距越快的越快的越快的越快的越快的越快的越快的越快的越快的越快的越快的越快的越快的越快的越好的越快的越好的越好。