视频实例分割的无遮挡方法 (Mask-Free Video Instance Segmentation)

The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at https://github.com/SysCV/MaskFreeVis.

翻译：近来视频实例分割（VIS）的发展主要是由于采用更深入且需要更多数据的基于变换器的模型。然而，视频遮罩在注释时繁琐且昂贵，极大地限制了现有VIS数据集的规模和多样性。在这项工作中，我们旨在消除遮罩注释要求。我们提出了MaskFreeVIS，它使用边界框注释对象状态，同时实现了极具竞争力的VIS性能。我们借助视频中丰富的时间遮罩一致性约束，通过引入时间KNN-patch Loss（TK-Loss）提供强大的遮罩监督，无需任何标签。我们的TK-Loss通过有效的补丁匹配步骤和K最近邻选择在帧之间找到一对多匹配。然后对找到的匹配施加一致性损失。我们的无遮挡目标简单易操作，没有可训练参数，计算效率高，但是优于采用例如最先进的光流来强制时间遮罩一致性的基线。我们在YouTube-VIS 2019/2021、OVIS和BDD100K MOTS基准测试上验证了MaskFreeVIS。结果清楚地证明，我们的方法显着缩小了全面和弱监督VIS性能之间的差距。我们的代码和训练模型可在https://github.com/SysCV/MaskFreeVis获取。

相关内容

视觉识别系统

关注 11

视觉识别系统出自“头脑风暴”一词。所谓头脑风暴（Brain-storming）系统是运用系统的、统一的视觉符号系统。视觉识别是静态的识别符号具体化、视觉化的传达形式，项目最多，层面最广，效果更直接。视觉识别系统属于CIS中的VI，用完整、体系的视觉传达体系，将企业理念、文化特质、服务内容、企业规范等抽象语意转换为具体符号的概念，塑造出独特的企业形象。视觉识别系统分为基本要素系统和应用要素系统两方面。基本要素系统主要包括：企业名称、企业标志、标准字、标准色、象征图案、宣传口语、市场行销报告书等。应用系统主要包括：办公事务用品、生产设备、建筑环境、产品包装、广告媒体、交通工具、衣着制服、旗帜、招牌、标识牌、橱窗、陈列展示等。视觉识别（VI）在CI系统大众所接受，据有主导的地位。

【ToG 2021】强化学习中图像局部区域敏感的探索奖励，Deep Reinforcement Learning with Part-aware Exploration Bonus in Video Games

专知会员服务

16+阅读 · 2022年3月29日

【CVPR2022】基于鲁棒区域特征生成的零样本目标检测

专知会员服务

11+阅读 · 2022年3月22日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日