Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.
翻译:在视频中进行分解和跟踪视频框中多个对象的例会分解,近年来引起了大量研究关注。在本文中,我们展示了一个新的微弱监督框架,使用\ textbf{S}S}patio-\textbf{T}T}当量{C}当量折录,例如\ textbf{STC-Seg}。具体地说,STC-Seg展示了四种贡献。首先,我们利用未经监督的深度估测和光学流的互补性来生成有效的创新假标签,用于培训深层网络和预测高品质实例面具。第二,我们设计了一个令人费解谜式框架,用于使用框级说明进行端对端培训。第三,我们的跟踪模块联合使用带框的对端框对端点,与模型的时序差异很大,这在很大程度上提高了不同对象外观的坚固度。最后,我们的框架是灵活且能够让图像水平的RCN分解方法,用于培训深深深层次网络的网络。我们设计了一个高超强的图像级视频段,用来在视频级的轨中测试中展示我们运行的轨图。