Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can anticipate object motion and interactions, which are crucial abilities for conceptual planning and reasoning. Recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object encodings, object permanence, and object reasoning. Here we introduce a self-supervised LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal-ventral pathways in the brain, Loci tackles the binding problem by processing separate, slot-wise encodings of 'what' and 'where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Besides exhibiting superior performance in current benchmarks, Loci effectively extracts objects from video streams and separates them into location and Gestalt components. We believe that this separation offers an encoding that will facilitate effective planning and reasoning on conceptual levels.
翻译:我们的大脑几乎可以不费力地将视觉数据流分解为背景和突出对象。 此外,它还可以预测物体运动和相互作用,这是概念规划和推理的关键能力。 最近的物体推理数据集,如CATER,揭示了当前基于视觉的AI系统的根本缺陷, 特别是针对明确对象编码、对象永久性和物体推理。 在这里, 我们引入了一个自我监督的定位和身份跟踪系统(Loci), 该系统在CATER跟踪挑战上非常出色。 受大脑的圆点- 静脉路径的启发, Loci 通过处理“ 何” 和“ 何” 的单独、 工作档- 编码, 解决了具有约束力的问题。 Loci 的预测性编码类似处理鼓励了主动最小化错误, 使单个位置能够对单个物体进行编码。 物体和物体动态之间的相互作用是在分解的潜伏空间中处理的。 经过时间调整后, 与前期资格累积大大加快学习并提高记忆效率。 除了展示当前基准的优异性性, Loci 外, 有效地将物体从视频流和概念上分离的分解, 我们相信它们的位置和概念分解会提供有效的定位。