We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem
翻译:我们提出XMem, 是一个由Atkinson-Shiffrin内存模型启发的具有统一特性存储存储器的长视频对象分割结构。 视频对象分割先前的工作通常只使用一种特性存储器。 对于超过一分钟的视频, 一个单一特性存储模型密切连接内存消耗和准确性。 相反, 我们根据Atkinson-Shiffrin模型, 开发了一个包含多个独立但有深度连接的特性存储器的结构: 一个快速更新的感应存储器, 一个高分辨率的工作记忆器, 以及一个因此可以持续的长期记忆器。 关键是, 我们开发了一种内存强化算法, 将经常使用的工作内存元素整合到长期记忆中, 以避免内存爆炸, 并尽量减少长期预测的性能衰减。 与新的记忆读机制相结合, XMem大大超过长视数据集的艺术状态性能, 同时与短视数据集中的最新方法( 不工作于长视频上) 。 代码可在 https://hkchengrex.github.ximo/XMemMe 上查阅。