Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit the information from the memory by global-to-global matching between the current and past frames, which lead to mismatching to similar objects and high computational complexity. To address these problems, we propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet). In RMNet, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames. For the current query frame, the query regions are tracked and predicted based on the optical flow estimated from the previous frame. The proposed local-to-local matching effectively alleviates the ambiguity of similar objects in both memory and query frames, which allows the information to be passed from the regional memory to the query region efficiently and effectively. Experimental results indicate that the proposed RMNet performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.
翻译:最近,几个基于空间-时间记忆的网络显示,过去框架的物体提示(例如视频框架和分离对象面罩)对当前框架的物体分割有用,但是,这些方法利用当前框架和过去框架之间的全球比对,从而导致与类似对象不匹配和高计算复杂性的记忆中信息。为了解决这些问题,我们建议为半监督VOS(即区域记忆网)提出一个新的地方对地方匹配解决方案。在RMNet中,精确的区域记忆是由对目标物体出现在过去框架的局部区域进行记忆的记忆构建的。对于目前的查询框架,查询区域根据从以往框架估计的光学流进行跟踪和预测。拟议的地方对地方的匹配有效地减轻了记忆和查询框架中类似物体的模糊性,从而能够将信息从区域记忆传递到查询区域。实验结果表明,拟议的RMNet对DAVIS和YouTube-VOS数据集的状态艺术方法表现良好。