Semi-supervised video object segmentation (VOS) refers to segmenting the target object in remaining frames given its annotation in the first frame, which has been actively studied in recent years. The key challenge lies in finding effective ways to exploit the spatio-temporal context of past frames to help learn discriminative target representation of current frame. In this paper, we propose a novel Siamese network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames. Technically, we use the transformer encoder and decoder to handle the past frames and current frame separately, i.e., the encoder encodes robust spatio-temporal context of target object from the past frames, while the decoder takes the feature embedding of current frame as the query to retrieve the target from the encoder output. To further enhance the target representation, a feature interaction module (FIM) is devised to promote the information flow between the encoder and decoder. Moreover, we employ the Siamese architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods. Experimental results on three challenging benchmarks validate the superiority of SITVOS over state-of-the-art methods.
翻译:半监督的视频天体分解( VOS) 是指将目标对象在剩余框架中的剩余框中进行分解,因为第一个框架的注解是近年来积极研究的。关键的挑战在于找到有效的方法,利用过去框架的时空环境,帮助学习当前框架的歧视性目标表达方式。在本文中,我们建议建立一个新型的暹罗网络,配有专门设计的交互式变压器,称为SITVOS,以便从历史到当前框架的有效背景传播。技术上,我们使用变压器编码器和解码器分别处理过去框架和当前框架,即编码编码器从过去框架对目标对象的坚固的时空环境,而解码器则将当前框架的特征嵌入,作为从编码器输出中检索目标的查询。为了进一步加强目标表达方式,我们设计了一个特征互动模块(FIM),以促进编码器和解码器之间的信息流动。此外,我们还使用Siase结构来分别处理过去框架和当前框架目标对象的坚固的骨架特征特征,而使SIS- 3号的升级比现有测试基准的特征再利用率方法更有利于SIS- 。