The referring video object segmentation task (RVOS) aims to segment object instances in a given video referred by a language expression in all video frames. Due to the requirement of understanding cross-modal semantics within individual instances, this task is more challenging than the traditional semi-supervised video object segmentation where the ground truth object masks in the first frame are given. With the great achievement of Transformer in object detection and object segmentation, RVOS has been made remarkable progress where ReferFormer achieved the state-of-the-art performance. In this work, based on the strong baseline framework--ReferFormer, we propose several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
翻译:参考视频对象分割任务(RVOS)旨在将所有视频框中以语言表达方式引用的某一视频中的对象实例进行分解。由于需要在单个情况下理解跨模式语义,这项任务比传统的半监督视频对象分割任务更具挑战性,即地面真相天体在第一个框中遮盖地面真相天体。随着变异器在物体探测和物体分割方面的巨大成就,RVOS取得了显著的进展,在Refermer实现最新性能的地方。在这项工作中,基于强有力的基线框架-Referformer,我们提出了进一步推进的若干技巧,包括周期学习率、半监督方式和测试-时间增强推论。在CVPR2022 参考Youtube-VOS挑战中,改进了Referformer排名第二位。