Modern neural networks use building blocks such as convolutions that are equivariant to arbitrary 2D translations. However, these vanilla blocks are not equivariant to arbitrary 3D translations in the projective manifold. Even then, all monocular 3D detectors use vanilla blocks to obtain the 3D coordinates, a task for which the vanilla blocks are not designed for. This paper takes the first step towards convolutions equivariant to arbitrary 3D translations in the projective manifold. Since the depth is the hardest to estimate for monocular detection, this paper proposes Depth EquiVarIAnt NeTwork (DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the projective manifold whereas vanilla networks are not. The additional depth equivariance forces the DEVIANT to learn consistent depth estimates, and therefore, DEVIANT achieves state-of-the-art monocular 3D detection results on KITTI and Waymo datasets in the image-only category and performs competitively to methods using extra information. Moreover, DEVIANT works better than vanilla networks in cross-dataset evaluation. Code and models at https://github.com/abhi1kumar/DEVIANT
翻译:现代神经网络使用建筑构件,例如与任意的 2D 翻译不相容的变异体。 然而, 这些香草块并不等同于投影管中任意的 3D 翻译。 即使如此, 所有单方 3D 探测器都使用香草块获得 3D 座标, 而香草块并不是设计要完成的任务。 本文件迈出了第一步, 使变异体变成投影管中任意的 3D 翻译。 由于深度最难估计单体探测结果, 本文建议用现有规模的变异管块建立深度 EquiVariant NET 工作( DEVIANT ) 。 结果, DeVIANT 使用投影管的深度翻译, 而 Vanilla 网络则没有设计。 额外的深度变异性迫使 变异性使 DEVIANT 学习一致的深度估计结果, 因此, DEVIANT在图像识别和WAWATM 类别中, 并使用额外的信息模型进行竞争性的方法评估。 此外, DEVIANDADADA/ 模型比ANDADADADADRABSBS 更好的工作。