In this paper, we present offline-to-online knowledge distillation (OOKD) for video instance segmentation (VIS), which transfers a wealth of video knowledge from an offline model to an online model for consistent prediction. Unlike previous methods that having adopting either an online or offline model, our single online model takes advantage of both models by distilling offline knowledge. To transfer knowledge correctly, we propose query filtering and association (QFA), which filters irrelevant queries to exact instances. Our KD with QFA increases the robustness of feature matching by encoding object-centric features from a single frame supplemented by long-range global information. We also propose a simple data augmentation scheme for knowledge distillation in the VIS task that fairly transfers the knowledge of all classes into the online model. Extensive experiments show that our method significantly improves the performance in video instance segmentation, especially for challenging datasets including long, dynamic sequences. Our method also achieves state-of-the-art performance on YTVIS-21, YTVIS-22, and OVIS datasets, with mAP scores of 46.1%, 43.6%, and 31.1%, respectively.
翻译:在本文中,我们展示了用于视频实例分解的离线至线知识蒸馏(OOKD),将大量视频知识从离线模型转移到在线预测的在线模型。与以往采用在线或离线模型的方法不同,我们单一的在线模型利用了两种模型。为了正确传递知识,我们建议查询过滤和联系(QFA),将无关的查询过滤到准确的事例。我们与QFA的KD通过远程全球信息补充的单个框架,通过编码对象中心特征,提高了功能匹配的强度。我们还提出了一个简单的数据增强方案,用于在VIS任务中进行知识提炼,将所有类别的知识公平传递到在线模型中。广泛的实验表明,我们的方法大大改进了视频实例分解的性能,特别是具有挑战性的数据集,包括长动态序列。我们的方法还在YTVIS-21、YTVIS-22和OVIS数据集中分别实现了46.1%、43.6%和31.1%的 mAP分级。