We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the processing unit/controller at each step. The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step. We show that TTM outperforms other alternatives, such as other Transformer models designed for long sequences and recurrent neural networks, on two real-world sequential visual understanding tasks: online temporal activity detection from videos and vision-based robot action policy learning. Code is publicly available at: https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing
翻译:我们提出了 Token 图灵机(TTM)。它是一种顺序的自回归 Transformer 模型,用于真实世界的序列视觉理解应用。我们的模型受到经典的神经图灵机启发,并具有外部记忆,由一组代表以前历史(即帧)的标记组成。该记忆通过在每个步骤中使用 Transformer 作为处理单元/控制器高效地寻址、读取和写入。模型的记忆模块确保新的观察会仅使用内存中的内容(而不是整个历史记录)进行处理,这意味着它可以在每个步骤上以有限的计算成本高效地处理长序列。我们展示了 TTM 在两项真实世界的序列视觉理解任务上的表现优于其他替代品,例如专为长序列设计的其他 Transformer 模型和循环神经网络:在线视频的时间活动检测和基于视觉的机器人动作策略学习。代码公开可用:https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing