Modeling temporal information for both detection and tracking in a unified framework has been proved a promising solution to video instance segmentation (VIS). However, how to effectively incorporate the temporal information into an online model remains an open problem. In this work, we propose a new online VIS paradigm named Instance As Identity (IAI), which models temporal information for both detection and tracking in an efficient way. In detail, IAI employs a novel identification module to predict identification number for tracking instances explicitly. For passing temporal information cross frame, IAI utilizes an association module which combines current features and past embeddings. Notably, IAI can be integrated with different image models. We conduct extensive experiments on three VIS benchmarks. IAI outperforms all the online competitors on YouTube-VIS-2019 (ResNet-101 43.7 mAP) and YouTube-VIS-2021 (ResNet-50 38.0 mAP). Surprisingly, on the more challenging OVIS, IAI achieves SOTA performance (20.6 mAP). Code is available at https://github.com/zfonemore/IAI
翻译:在统一框架内为探测和跟踪进行模拟时间信息,已证明是在视频实例分割(VIS)方面一个很有希望的解决办法。然而,如何有效地将时间信息纳入在线模型仍然是一个尚未解决的问题。在这项工作中,我们提出了一个新的在线VIS范例,名为“Bunices As Indentation”(IAI),它以高效的方式为探测和跟踪提供时间信息模型。详细来说,AI使用一个新的识别模块来预测识别数据,以明确跟踪实例。对于传递时间信息跨框,AI使用一个结合当前特征和以往嵌入的关联模块。值得注意的是,IAI可以与不同的图像模型相结合。我们就三个VIS基准进行了广泛的实验。IAI在YouTube-VIS-2019(Res-101 43.7 mAP)和YouTube-VIS-2021(ResNet-50.38.0 mAP)上,所有在线竞争者都比IAVIS/IAI上的所有在线竞争者都好。令人惊讶的是,关于更具挑战性的 OVIS,AI实现SOTA业绩(20.6 mAP)的代码见https://github.com/zfonemore/IAI/IAI。