The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can both be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps. As such, it simplifies the RVOS pipeline considerably compared to existing methods. Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76 frames per second. In addition, we report strong results on the public validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has yet to receive the attention of researchers. The code to reproduce our experiments is available at https://github.com/mttr2021/MTTR
翻译:参考视频对象分割任务(RVOS)涉及在给定视频框框内对文本引用对象实例进行分解。由于这一多式联运任务的复杂性性质,将文字推理、视频理解、实例分解和跟踪结合起来,现有方法通常依靠复杂的管道处理。在本文件中,我们建议对REVOS采用简单的变压器法。我们的框架,称为多式跟踪变压器(MTTR),将RVOS的任务作为序列预测问题。在计算机视觉和自然语言处理方面最近的进展之后,MTTR基于这样一种认识,即视频和文本可以同时由单一的多式联运变压器模型有效和优雅地一起处理。MTTR是端到端的训练,没有与文字相关的偏向偏移偏移部分,不需要额外的后处理步骤。因此,我们的框架,称为多式跟踪跟踪器(MVOS)的管道与现有方法相比,显示MTTR的管道大大超越了以往的艺术。特别是,MTRTR-TR-RVERS的注意度+5 和IM-MA-MAL 模型的难度更大,而我们在JVS-RS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Silentral-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Servial-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-C-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S