We propose a novel visual memory network architecture for the learning and inference problem in the spatial-temporal domain. Different from the popular transformers, we maintain a fixed set of memory slots in our memory network and explore designs to input new information into the memory, combine the information in different memory slots and decide when to discard old memory slots. Finally, this architecture is benchmarked on the video object segmentation and video prediction problems. Through the experiments, we show that our memory architecture can achieve competitive results with state-of-the-art while maintaining constant memory capacity.
翻译:我们为空间时空域的学习和推论问题提出了一个新的视觉记忆网络架构。与流行的变压器不同,我们在记忆网络中保留一套固定的记忆位置,并探索将新信息输入记忆的设计,将不同记忆位置的信息合并,决定何时丢弃旧的记忆位置。最后,这一架构以视频对象分割和视频预测问题为基准。通过实验,我们展示了我们的记忆架构在保持恒定的记忆能力的同时,能够以最新技术实现竞争性结果。