Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.
翻译:以视频对象分割为目的, 目的是分割特定语言表达式所引用的对象。 现有的工程通常要求压缩视频位流在分割前解码到 RGB 框架, 从而增加计算和存储要求, 并最终减缓推算。 这可能会妨碍其在现实世界计算资源有限的情景中的应用, 比如自动汽车和无人驾驶飞机。 为了缓解这一问题, 我们在本文件中探索压缩视频中的引用对象分割任务, 即原始视频数据流。 除了视频中提及对象分割任务本身的内在困难, 从压缩视频中获取区分代表也相当具有挑战性。 为了解决这个问题, 我们提议了一个多关注网络, 由双向双向双向处理和存储要求和基于查询的跨模式变换转换器模块组成。 具体地说, 双向双向双向双向双向关注模块, 旨在从压缩数据的三个模式, 即 I- 框架、 Motional Vicental 和剩余数据流中获取有效的表达方式。 以调试调的跨模式第一模型模拟语言和视觉模式的对比多式多式版本处理功能, 然后又比较的多式处理多式版本版本版本版本版本版本版本版本版本版本版本版本版本版本版本版本版本的版本版本的版本数据功能功能功能, 将用来显示一个驱动的当前版本的当前版本的版本数据查询的版本的版本, 将演示式数据演示式版本的版本的当前版本, 将演示式的当前版本的版本的版本的版本的版本的版本的流程式查询方法将演示式查询方法将演示式数据转换为当前版本, 。