We propose a novel end-to-end solution for video instance segmentation (VIS) based on transformers. Recently, the per-clip pipeline shows superior performance over per-frame methods leveraging richer information from multiple frames. However, previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications, limiting practicality. In this work, we propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip. Specifically, we propose to utilize concise memory tokens as a mean of conveying information as well as summarizing each frame scene. The features of each frame are enriched and correlated with other frames through exchange of information between the precisely encoded memory tokens. We validate our method on the latest benchmark sets and achieved the state-of-the-art performance (AP 44.6 on YouTube-VIS 2019 val set using the offline inference) while having a considerably fast runtime (89.4 FPS). Our method can also be applied to near-online inference for processing a video in real-time with only a small delay. The code will be made available.
翻译:我们建议基于变压器的视频实例分割(VIS)新颖的端对端解决方案。 最近, 单剪管显示, 利用多个框架的更丰富信息, 相对于每个框架的优势性能。 但是, 以往的每剪机需要大量的计算和记忆使用, 以实现框架对框架的通信, 限制了实用性。 在这项工作中, 我们提议跨框架通信变换器( IFC ), 通过在输入剪辑中有效地对上下文进行编码, 大幅降低框架之间信息传输的间接费用 。 具体地说, 我们提议使用简洁的记忆符号作为传递信息以及总结每个框架场景的手段。 每个框架的特征通过精确编码的记忆符号之间的信息交流而丰富并与其他框架相关。 我们在最新的基准集上验证我们的方法, 并实现最新艺术性表现( AP 44.6 on YouTube-VIS 2019 val supet ), 同时相当快的运行时间( 89.4 FPS) 。 我们的方法还可以被应用到近线上的推导推导, 在实时处理视频时只有少量的延迟。 代码。