TransFusionOdom: 可解释的基于 Transformer 的 LiDAR-惯性融合里程计估计 (TransFusionOdom: Interpretable Transformer-based LiDAR-Inertial Fusion Odometry Estimation)

Multi-modal fusion of sensors is a commonly used approach to enhance the performance of odometry estimation, which is also a fundamental module for mobile robots. However, the question of \textit{how to perform fusion among different modalities in a supervised sensor fusion odometry estimation task?} is still one of challenging issues remains. Some simple operations, such as element-wise summation and concatenation, are not capable of assigning adaptive attentional weights to incorporate different modalities efficiently, which make it difficult to achieve competitive odometry results. Recently, the Transformer architecture has shown potential for multi-modal fusion tasks, particularly in the domains of vision with language. In this work, we propose an end-to-end supervised Transformer-based LiDAR-Inertial fusion framework (namely TransFusionOdom) for odometry estimation. The multi-attention fusion module demonstrates different fusion approaches for homogeneous and heterogeneous modalities to address the overfitting problem that can arise from blindly increasing the complexity of the model. Additionally, to interpret the learning process of the Transformer-based multi-modal interactions, a general visualization approach is introduced to illustrate the interactions between modalities. Moreover, exhaustive ablation studies evaluate different multi-modal fusion strategies to verify the performance of the proposed fusion strategy. A synthetic multi-modal dataset is made public to validate the generalization ability of the proposed fusion strategy, which also works for other combinations of different modalities. The quantitative and qualitative odometry evaluations on the KITTI dataset verify the proposed TransFusionOdom could achieve superior performance compared with other related works.

翻译：多模态传感器的融合是增强里程计估计性能的一种常用方法，也是移动机器人的基本模块之一。然而，如何在监督式传感器融合里程计估计任务中进行不同模态的融合仍是一个具有挑战性的问题。一些简单的操作，如元素级求和和连接，无法分配自适应的注意权重以有效地融合不同的模态，这使得难以实现具有竞争性的里程计结果。最近，Transformer 架构在视觉与语言领域的多模态融合任务中显示出潜力。在本文中，我们提出了一种端到端的监督式 Transformer 模型网络，用于 LiDAR-惯性融合里程计估计（即 TransFusionOdom）。多注意力融合模块展示了不同的融合方法，以处理盲目提高模型复杂度可能导致的过拟合问题。为了解释基于 Transformer 的多模态交互的学习过程，介绍了一种通用的可视化方法来说明模态之间的相互影响。此外，详尽的削减研究评估了不同的多模态融合策略，以验证所提出的融合策略的性能。我们提供了一个合成多模态数据集来验证所提出的融合策略的泛化能力，该策略也适用于其他不同模态的组合。在 KITTI 数据集上的定量和定性里程计评估验证了所提出的 TransFusionOdom 相比其他相关工作可以实现更优越的性能。