Weakly supervised instance segmentation reduces the cost of annotations required to train models. However, existing approaches which rely only on image-level class labels predominantly suffer from errors due to (a) partial segmentation of objects and (b) missing object predictions. We show that these issues can be better addressed by training with weakly labeled videos instead of images. In videos, motion and temporal consistency of predictions across frames provide complementary signals which can help segmentation. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation. We propose two ways to leverage this information in our model. First, we adapt inter-pixel relation network (IRN) to effectively incorporate motion information during training. Second, we introduce a new MaskConsist module, which addresses the problem of missing object instances by transferring stable predictions between neighboring frames during training. We demonstrate that both approaches together improve the instance segmentation metric $AP_{50}$ on video frames of two datasets: Youtube-VIS and Cityscapes by $5\%$ and $3\%$ respectively.
翻译:然而,目前仅依赖图像级类标签的方法主要由于下列原因出现错误:(a) 部分分割物体和(b) 缺失对象预测。我们表明,这些问题可以通过使用贴有标签的微弱视频而不是图像来更好地解决。在视频中,跨框架预测的动态和时间一致性提供了补充信号,有助于分割。我们是第一个探索使用这些视频信号解决薄弱监督实例分割的方法。我们提出了两种方法来利用模型中的这一信息。首先,我们调整了跨像素关系网络(IRN),以便在培训中有效地纳入运动信息。第二,我们引入一个新的MaskConsist 模块,通过在培训期间将相邻框架之间的稳定预测传输来解决缺失对象案例问题。我们证明,两种方法共同改进了两个数据集(Youtube-VIS和Cityscape)视频框架的图像分割度,即Youtube-VIS和城市景象,分别增加5美元和3美元。