Transformers have recently been popular for learning and inference in the spatial-temporal domain. However, their performance relies on storing and applying attention to the feature tensor of each frame in video. Hence, their space and time complexity increase linearly as the length of video grows, which could be very costly for long videos. We propose a novel visual memory network architecture for the learning and inference problem in the spatial-temporal domain. We maintain a fixed set of memory slots in our memory network and propose an algorithm based on Gumbel-Softmax to learn an adaptive strategy to update this memory. Finally, this architecture is benchmarked on the video object segmentation (VOS) and video prediction problems. We demonstrate that our memory architecture achieves state-of-the-art results, outperforming transformer-based methods on VOS and other recent methods on video prediction while maintaining constant memory capacity independent of the sequence length.
翻译:最近,在空间时空域中,变异器对学习和推断很受欢迎,但其性能依赖于存储和关注视频中每个框架的强度特征。 因此,随着视频长度的增加,它们的空间和时间复杂性将线性地增加,对长视频来说成本会很高。 我们为空间时空域的学习和推论问题提出了一个新型的视觉记忆网络架构。 我们在记忆网络中保留了一套固定的内存位置, 并基于 Gumbel-Softmax 提出了一个基于 Gumbel-Softmax 的算法, 以学习更新记忆的适应性战略。 最后, 这一架构以视频对象分割和视频预测问题为基准。 我们表明,我们的记忆结构实现了最先进的结果、基于VOS 的超效变异器方法以及其它最新的视频预测方法,同时保持与序列长度独立的恒定的内存能力。