In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code is available at https://github.com/wjf5203/SeqFormer.
翻译:在这项工作中,我们向SeqFormer展示视频实例分割。 SeqFormer 遵循视觉变压器的原则,即模拟视频框架之间的关系。然而,我们观察到独立实例查询足以在视频中捕捉一个时间序列的事件,但关注机制应独立地对每个框架进行。为此,SeqFormer在每个框架中查找一个实例,并汇总时间信息,以学习一个强大的视频级实例,用于动态预测每个框架的面具序列。在没有跟踪分支或后处理的情况下,自然地实现案例跟踪。在YouTube-VIS上,SeqFormer 以ResNet-50主干线和49.0 AP 实现47.4 AP,使用ResNet-101主干线而没有钟和哨子和哨子。为了实现这一点,SeqreqFormer 分别在4.6 和4.4 中分别定位一个实例,并汇总了先前提议的Swin变压器,Seqformer 取得了59.3。我们希望Seqfmer 能够成为一个强有力的基准,在视频实例分割中促进未来研究。Screfrual destreal development 。