Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images.
翻译:变压器的许多调整已经出现,以应对单一模式的视觉任务,在这种任务中,自我注意模块被堆积起来,处理图像等输入源。直观地说,将多种数据模式注入变压器可以改善变压器的性能,然而,内式注意重量也可能被稀释,从而可能破坏最后性能。在本文中,我们提议了一种多式联运代号组合法(TokenFusion),为基于变压器的视觉任务定制。为了有效地融合多种模式,TokenFusion 动态地探测出非信息标志,用预测和综合的跨模式特性替代这些标志。还采用了残余位置调整,以便能够明确利用融合后的内式调整。 TokenFusion 的设计使变压器能够学习多式联运特征之间的相互关系,而单式变压器结构基本保持不变。在各种统一和混合模式上进行了广泛的实验,并表明托肯福松过三种典型的视觉任务中最先进的方法:多式图像点和云色图象转换、RGB深度图像探测和三色断段。