Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed \underline{c}ross-mod\underline{a}l \underline{v}iew-mixed transform\underline{er} (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components.
翻译:大多数现有的双模式(RGB-D和RGB-T)突出对象探测方法(RGB-D和RGB-T)都使用混凝土操作,并构建复杂的交织融合结构,以实现跨模式的信息整合。 混凝土操作固有的本地连通性限制了基于混凝土的方法的绩效,使这种方法达到上限。 在这项工作中,我们从全球信息对齐和转换的角度重新思考这些任务。 具体地说, 拟议的双线{crs-modes-modline{a}{a}l\sunderline{v}midle- mixed transport\ underline{er} (CAVER) 将几个跨模式集成单位连成多个跨模式集成单元, 以构建一个自上至下变压器基于信息传播路径的信息传播路径。 CAVERVL将多尺度和多模式集成的集成过程作为从新的视觉混合关注机制建立起来的顺序- 。 此外,考虑到拟议的四面复杂复杂性(w.r.t) 输入符号的数量,我们设计了一个无参数的配对调的重符号重新配置战略, 以简化的SGB- d- developd- develdeal 框架展示了两个实验式的SGB- deal- deal- deg- develd- deg- slad- develd- slad- slad- slap- slad- sred- slad- smad- slad- slad- smad- slad-d-d- sal- smad- slad-d-d-d-d-d-d-d-d-d-d-d-d-s-s-s-s-s-s-s-s-s-s- sal- sal- sal- slad-d- sal- sal-s-s- sal-s-s- sal- sal- sal-s-s sal- sal- slad- slad-s-s-s-s-s- slad-s- slad- slad- slad-d-d-d-s- s-s-