We approach instantaneous mapping, converting images to a top-down view of the world, as a translation problem. We show how a novel form of transformer network can be used to map from images and video directly to an overhead map or bird's-eye-view (BEV) of the world, in a single end-to-end network. We assume a 1-1 correspondence between a vertical scanline in the image, and rays passing through the camera location in an overhead map. This lets us formulate map generation from an image as a set of sequence-to-sequence translations. Posing the problem as translation allows the network to use the context of the image when interpreting the role of each pixel. This constrained formulation, based upon a strong physical grounding of the problem, leads to a restricted transformer network that is convolutional in the horizontal direction only. The structure allows us to make efficient use of data when training, and obtains state-of-the-art results for instantaneous mapping of three large-scale datasets, including a 15% and 30% relative gain against existing best performing methods on the nuScenes and Argoverse datasets, respectively. We make our code available on https://github.com/avishkarsaha/translating-images-into-maps.
翻译:我们使用瞬时映射,将图像转换成自上而下的世界观,作为翻译问题。 我们展示了如何使用一种新型变压器网络, 在一个单一端对端网络中, 从图像和视频直接绘制到世界的顶部地图或鸟眼视图(BEV) 。 我们假设图像的垂直扫描线与通过摄像头位置的射线之间有1-1对应, 在高端地图中, 这样让我们能够从图像中绘制地图, 作为一组序列到序列翻译。 翻译时出现的问题使得网络在解释每个像素的作用时能够使用图像的背景。 这种受限的配制基于对问题的强烈物理定位, 导致一个有限的变压器网络, 仅在水平方向上具有革命性。 结构使我们能够在培训时有效地使用数据, 并获得最新艺术结果, 用于对三个大型数据集进行即时映, 包括15% 和30% 相对收益, 相对于现有在 nuScenusab/Argoeval 上的最佳演算方法, 分别提供 https- squalas/ Argostranstrats。