Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal interaction occurring in the decoding phase. To tackle these limitations, we propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase. Concretely, cross-modal attention is performed among the temporal encoder, referring words and the spatial encoder to aggregate and transfer language-relevant motion and appearance information. In addition, we also propose a Bilateral Channel Activation (BCA) module in the decoding phase for further denoising and highlighting the spatial-temporal consistent features via channel-wise activation. Extensive experiments show our method achieves new state-of-the-art performances on four popular benchmarks with 6.8% and 6.9% absolute AP gains on A2D Sentences and J-HMDB Sentences respectively, while consuming around 7x less computational overhead.
翻译:之前的方法或依赖于 3D ConvNets, 或纳入额外的 2D ConvNets 作为编码器, 以提取混合空间时空特征。然而,这些方法由于在解码阶段出现的延迟和隐含的空间-时际互动而出现空间不匹配或虚假分流器。为了克服这些限制,我们提议了一个语言-Bridged Duple 传输(LBDT)模块,该模块使用语言作为中间桥梁,在编码阶段早期实现明确和适应性空间-时空互动。具体地说,在时间编码器中,交叉式注意将文字和空间编码器指向语言相关运动和外观信息的汇总和传输。此外,我们还提议在解码阶段建立一个双边频道启动模块,以便通过频道激活进一步解调和突出空间-时际一致性特征。广泛的实验显示,我们的方法在四个流行基准上实现了新状态的J-艺术性表现,同时将语言和空间编码转换为: 6.8% 和6.9% AS 和6.9% IMB 的绝对计算结果分别用于A- 6.8%和6.8% IM-D 4 IMal IMBal 4 IMBal 和6.