Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.
翻译:最近,视觉变压器在图像层面的视觉识别任务方面取得了巨大成功。为了在视频剪辑中有效和高效地模拟关键的时间信息,我们提议在视频剪辑中为视频实例分解(VIS)建立一个节奏高效的视觉变换器(TeVIT),这与以往基于变压器的VIS方法不同,TeVIT几乎是无变式的,它包含一个变压器主干网和一个基于查询的视频实例分割头部。在骨干阶段,我们提议为早期时间环境聚合建立一个几乎没有参数的送信器转换机制。在头阶段,我们提议一个共享参数共享的时空查询互动机制,以在视频实例和查询之间建立一对一的通信。因此,TeVIT充分利用了框架级和例级的时间背景信息,并获得了强大的时间模型模型能力,而额外的计算成本微不足道。在三个广泛采用的VIS基准上,即YouTube-VIS-2019、YouTube-VIS 和OVIT获得州-20项结果并保持高推速度,例如,46.S-VEVS/VE/VS.VS.VS.VS.