The ability to carve the world into useful abstractions in order to reason about time and space is a crucial component of intelligence. In order to successfully perceive and act effectively using senses we must parse and compress large amounts of information for further downstream reasoning to take place, allowing increasingly complex concepts to emerge. If there is any hope to scale representation learning methods to work with real world scenes and temporal dynamics then there must be a way to learn accurate, concise, and composable abstractions across time. We present the Slot Transformer, an architecture that leverages slot attention, transformers and iterative variational inference on video scene data to infer such representations. We evaluate the Slot Transformer on CLEVRER, Kinetics-600 and CATER datesets and demonstrate that the approach allows us to develop robust modeling and reasoning around complex behaviours as well as scores on these datasets that compare favourably to existing baselines. Finally we evaluate the effectiveness of key components of the architecture, the model's representational capacity and its ability to predict from incomplete input.
翻译:为了解时间和空间,将世界分成有用的抽象体的能力是情报的一个关键组成部分。为了成功地认识和运用感官来有效采取行动,我们必须分析并压缩大量信息,以便进一步进行下游推理,允许出现日益复杂的概念。如果有希望将代表性学习方法与真实的世界景象和时间动态相配合,那么就必须有办法在时间上学习准确、简洁和可折叠的抽象体。我们展示了斯洛特变形器,这是一个利用视频现场数据的变异感应、变异器和迭代变推论来推断这种表达的架构。我们评估了CLEVRER、Kinic-600和CATER日期的斯洛特变形器,并证明这一方法使我们能够围绕复杂的行为以及这些数据集的分数发展稳健的模型和推理,这些数据集与现有的基线相比是有利的。最后,我们评估了结构关键组成部分的有效性、模型的代表性能力及其从不完全的投入中预测的能力。