In robotics, Visual Place Recognition is a continuous process that receives as input a video stream to produce a hypothesis of the robot's current position within a map of known places. This task requires robust, scalable, and efficient techniques for real applications. This work proposes a detailed taxonomy of techniques using sequential descriptors, highlighting different mechanism to fuse the information from the individual images. This categorization is supported by a complete benchmark of experimental results that provides evidence on the strengths and weaknesses of these different architectural choices. In comparison to existing sequential descriptors methods, we further investigate the viability of Transformers instead of CNN backbones, and we propose a new ad-hoc sequence-level aggregator called SeqVLAD, which outperforms prior state of the art on different datasets. The code is available at https://github.com/vandal-vpr/vg-transformers.
翻译:在机器人中,视觉定位识别是一个连续过程,作为输入,接收一个视频流,以得出机器人在已知地点地图中目前位置的假设。这项任务需要可靠、可扩展和高效的实际应用技术。这项工作提出了使用顺序描述器的详细技术分类方法,强调从单个图像中整合信息的不同机制。这一分类得到一个完整的实验结果基准的支持,该基准为这些不同建筑选择的优缺点提供了证据。与现有的顺序描述仪方法相比,我们进一步调查变异器而不是CNN骨干的可行性,我们提出了一个新的名为SeqVLAD的热级定序级聚合器,它优于不同数据集以前的艺术状态。该代码可在https://github.com/vandal-vpr/vg-transfects查阅。