Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Although many models have been proposed and have achieved good performance by learning short-range local features, long-range contextual properties have often been neglected. Moreover, the model size has also become a bottleneck for their wide applications. To overcome these challenges, we propose a super light-weight network model termed SVT-Net for large scale place recognition. Specifically, on top of the highly efficient 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short-range local features and long-range contextual features in this model. Consisting of ASVT and CSVT, SVT-Net can achieve state-of-the-art on benchmark datasets in terms of both accuracy and speed with a super-light model size (0.9M). Meanwhile, two simplified versions of SVT-Net are introduced, which also achieve state-of-the-art and further reduce the model size to 0.8M and 0.4M respectively.
翻译:以云为主的大型云点定位对于许多应用来说至关重要,如同声相向的本地化和绘图(SLAM)等。虽然提出了许多模型,并且通过学习短距离本地特征取得了良好的绩效,但长距离背景属性往往被忽视。此外,模型大小也成为其广泛应用的瓶颈。为了克服这些挑战,我们提出了一个超轻量网络模型,称为SVT-Net,用于大规模位置识别。具体地说,除了高效的 3D Sparse Convolution(SP-Conv)、以Atom为主的Sparse Voxel变异器(ASVT)和以集群为基础的Sparse Voxel变异器(CSVT)之外,还提议在该模型中学习短距离本地特征和长距离背景特征。 ASVT和CSVT的结合,SVT-Net可以实现超光速精确度和速度基准数据集的状态(0.9M),同时,还引入了两个SVT-Net的简化版本,分别实现0.18M和0.8M的状态和进一步缩小模型。