Matching-based methods, especially those based on space-time memory, are significantly ahead of other solutions in semi-supervised video object segmentation (VOS). However, continuously growing and redundant template features lead to an inefficient inference. To alleviate this, we propose a novel Sequential Weighted Expectation-Maximization (SWEM) network to greatly reduce the redundancy of memory features. Different from the previous methods which only detect feature redundancy between frames, SWEM merges both intra-frame and inter-frame similar features by leveraging the sequential weighted EM algorithm. Further, adaptive weights for frame features endow SWEM with the flexibility to represent hard samples, improving the discrimination of templates. Besides, the proposed method maintains a fixed number of template features in memory, which ensures the stable inference complexity of the VOS system. Extensive experiments on commonly used DAVIS and YouTube-VOS datasets verify the high efficiency (36 FPS) and high performance (84.3\% $\mathcal{J}\&\mathcal{F}$ on DAVIS 2017 validation dataset) of SWEM. Code is available at: https://github.com/lmm077/SWEM.
翻译:以匹配为基础的方法,特别是基于时空内存的方法,大大领先于半监控视频物体分割法(VOS)中的其他解决办法。然而,不断增长和冗余的模板特征导致低效率推断。为了减轻这一影响,我们提议建立一个新型的序列加权期望-最大化(SWEM)网络,以大大减少记忆特征的冗余。不同于以往仅利用顺序加权的EM算法来探测框架间特征冗余的方法,SWEM通过利用顺序加权的EM算法将内部和框架间类似特征合并。此外,框架特征下端的SWEM的适应性重量具有代表硬样本的灵活性,改进模板的区别性。此外,拟议方法在记忆中保留固定数量的模板特征,以确保VOS系统稳定的推断复杂性。关于常用DAVIS和YouTube-VOS数据集的广泛实验可以核实SWEM/SWC的高度效率(36 FPS)和高性能(84.3 $\mathcal{J ⁇ math cal{F}$,用于DVIS 201717的验证数据设置。 EMmmmmm/S.Mmm/SDAGUB/S/S/SDODOD/S/S/SDOD/S/S/SDM/SOD/S/SDM/SDM/S/S/S/SDMMM/S/S/S/SDM/SDM/S/MM/M/M/SD/M/SDM/SD/SD/SD/M/M/M/M/M/M/SD/M/M/M/M/SD/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/MM/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/M/