Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.
翻译:变压器的许多调整已经出现,以应对单一模式的视觉任务,在这种任务中,自我注意模块堆积起来,以处理图像等输入源。直观地说,将多种数据模式注入变压器可以改善变压器的性能,然而,内式注意重量也可能被稀释,从而可能破坏最后性能。在本文中,我们提议了一种多式联运代号组合法(TokenFusion),该方法针对变压器的视觉任务。为了有效地融合多种模式,TokenFusion 动态地探测出非信息标志,并以预测和综合的多种模式间特性替代这些标志。还采用了残余位置调整,以便能够明确利用融合后的内式调整。托肯Fusion的设计可以使变压器学习多式联运特征之间的关联,而单一式变压器结构基本上保持不变。在各种统一和混合模式上进行了广泛的实验,并表明托肯Fusion在三种典型的视觉任务中超越了状态-艺术方法:多式图像-Fimagetoimage 翻译, RGB-deal-deal-degiax atalation at atalation at atriation at at atriation ativation atriction at at.