Video instance segmentation (VIS) aims at segmenting and tracking objects in videos. Prior methods typically generate frame-level or clip-level object instances first and then associate them by either additional tracking heads or complex instance matching algorithms. This explicit instance association approach increases system complexity and fails to fully exploit temporal cues in videos. In this paper, we design a simple, fast and yet effective query-based framework for online VIS. Relying on an instance query and proposal propagation mechanism with several specially developed components, this framework can perform accurate instance association implicitly. Specifically, we generate frame-level object instances based on a set of instance query-proposal pairs propagated from previous frames. This instance query-proposal pair is learned to bind with one specific object across frames through conscientiously developed strategies. When using such a pair to predict an object instance on the current frame, not only the generated instance is automatically associated with its precursors on previous frames, but the model gets a good prior for predicting the same object. In this way, we naturally achieve implicit instance association in parallel with segmentation and elegantly take advantage of temporal clues in videos. To show the effectiveness of our method InsPro, we evaluate it on two popular VIS benchmarks, i.e., YouTube-VIS 2019 and YouTube-VIS 2021. Without bells-and-whistles, our InsPro with ResNet-50 backbone achieves 43.2 AP and 37.6 AP on these two benchmarks respectively, outperforming all other online VIS methods.
翻译:视频实例分割(VIS) 旨在分割和跟踪视频中的对象。 先前的方法通常首先生成框架级或剪级对象实例, 然后通过额外的跟踪头或复杂的实例匹配算法将其连接起来。 这种明确的实例关联方法增加了系统复杂性, 未能充分利用视频中的时间提示 。 在本文中, 我们为在线 VS 设计了一个简单、 快速而有效的查询框架 。 依靠一个实例查询和提案传播机制, 包含几个特别开发的组件, 这个框架可以隐含地实现精确实例关联 。 具体地说, 我们生成框架级对象实例实例实例实例以一组从前框架传播的查询- 匹配对配。 这个实例查询- 组合通过自觉开发的战略来学习如何与一个特定对象捆绑在一起 。 当我们使用这样的对子来预测当前框架中的一个对象实例时, 不仅生成的实例会自动与先前框架中的前体相联系, 而且模型在预测同一对象之前会很好 。 这样, 我们自然实现了隐含实例关联, 并且利用了从前框中传播的一组查询- 对应的一组查询框框。 为了显示我们的方法, IPS 20S 和VIS 。 。 我们用的方法在2019 和双基 。 我们的 IP 。 我们的系统 的 和双级 。 我们用的方法在20S IP IP 。