Multiview detection incorporates multiple camera views to deal with occlusions, and its central problem is multiview aggregation. Given feature map projections from multiple views onto a common ground plane, the state-of-the-art method addresses this problem via convolution, which applies the same calculation regardless of object locations. However, such translation-invariant behaviors might not be the best choice, as object features undergo various projection distortions according to their positions and cameras. In this paper, we propose a novel multiview detector, MVDeTr, that adopts a newly introduced shadow transformer to aggregate multiview information. Unlike convolutions, shadow transformer attends differently at different positions and cameras to deal with various shadow-like distortions. We propose an effective training scheme that includes a new view-coherent data augmentation method, which applies random augmentations while maintaining multiview consistency. On two multiview detection benchmarks, we report new state-of-the-art accuracy with the proposed system. Code is available at https://github.com/hou-yz/MVDeTr.
翻译:多视图探测包含多个摄像视图, 处理封闭性, 而其中心问题是多视图聚合。 鉴于从多个视图到共同地面平面的地貌地图预测, 最先进的方法通过混凝土来解决这个问题, 不论对象位置如何, 使用相同的计算方法。 然而, 这种翻译异性行为可能不是最佳选择, 因为对象特征根据其位置和相机的不同而产生不同的投影扭曲。 在本文中, 我们提出一个新的多视图探测器MVDeTr, 采用新引入的影子变压器来汇总多视图信息。 不同于 convolutions, 影子变压器在不同的位置和相机上以不同的方式参与处理各种类似阴影的扭曲现象。 我们提议一个有效的培训计划, 其中包括一个新的视觉相近数据增强方法, 在保持多视图一致性的同时应用随机增强。 在两个多视图检测基准上, 我们报告与拟议系统的新状态- 准确性。 代码可在 http://github. com/ ho- yz/ MVDDDTr 上查阅 。