Humans continue to vastly outperform modern AI systems in their ability to parse and understand complex visual scenes flexibly. Attention and memory are two systems known to play a critical role in our ability to selectively maintain and manipulate behaviorally-relevant visual information to solve some of the most challenging visual reasoning tasks. Here, we present a novel architecture for visual reasoning inspired by the cognitive-science literature on visual reasoning, the Memory- and Attention-based (visual) REasOning (MAREO) architecture. MAREO instantiates an active-vision theory, which posits that the brain solves complex visual reasoning problems compositionally by learning to combine previously-learned elementary visual operations to form more complex visual routines. MAREO learns to solve visual reasoning tasks via sequences of attention shifts to route and maintain task-relevant visual information into a memory bank via a multi-head transformer module. Visual routines are then deployed by a dedicated reasoning module trained to judge various relations between objects in the scenes. Experiments on four types of reasoning tasks demonstrate MAREO's ability to learn visual routines in a robust and sample-efficient manner.
翻译:人类在灵活分析和理解复杂的视觉场景的能力方面,仍然大大优于现代人工智能系统。 注意力和记忆是已知的两个系统,在有选择地维持和操作与行为相关的视觉信息以解决一些最具挑战性的视觉推理任务的能力方面发挥着关键的作用。 在这里,我们展示了由视觉推理方面的认知科学文献、记忆和注意力基础(视觉)REASOIN(MAREO)架构所启发的视觉推理新结构。 MAREO即时运用了一种积极的视觉理论,该理论认为大脑通过学习将以前学过的初级视觉操作结合起来,形成更复杂的视觉常规程序,从而在结构上可以解决复杂的视觉推理问题。 MAREO学会通过将注意力转移到路线上的顺序,通过多头变动模块将与任务相关的视觉信息保存到记忆库中。 然后,通过一个经过专门训练来判断场景中物体之间各种关系的专门推理模块来部署视觉常规。 对四种推理任务的实验表明MAREO以稳健和抽样方式学习视觉常规的能力。