Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed \underline{c}ross-mod\underline{a}l \underline{v}iew-mixed transform\underline{er} (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components. Code and pretrained models will be available at \href{https://github.com/lartpang/CAVER}{the link}.
翻译:大多数现有的双模式( RGB- D 和 RGB- T) 突出对象探测方法( RGB- D 和 RGB- T ) 大多利用现有的双模式( RGB- D 和 RGB- T) 突出对象探测方法, 利用 convolution 操作, 并构建复杂的跨网络融合结构, 以实现跨模式的信息整合。 融合操作固有的本地连接将 Convolution 方法的性能限制到一个上限。 在这项工作中, 我们从全球信息调整和转换的角度重新思考这些任务。 具体地说, 拟议的下线{ crs- modes- modes- sunderline {a} { { sunderline{ v} iew- mixed translation\ suffirline{er} ( CAVER) 连成数个跨模式集成数个跨模式集成多个跨模式集成集成集, 以建立自上至上调的变压器信息传播路径。 CARC- developal com- develop laveal laction the das- develop laction the data- develop sladestrop the slades the slaft slades- dal lad slaft sal laft slad slaft slaft slaft s