超越革命的迭代视觉理性 (Iterative Visual Reasoning Beyond Convolutions)

We present a novel framework for iterative visual reasoning. Our framework goes beyond current recognition systems that lack the capability to reason beyond stack of convolutions. The framework consists of two core modules: a local module that uses spatial memory to store previous beliefs with parallel updates; and a global graph-reasoning module. Our graph module has three components: a) a knowledge graph where we represent classes as nodes and build edges to encode different types of semantic relationships between them; b) a region graph of the current image where regions in the image are nodes and spatial relationships between these regions are edges; c) an assignment graph that assigns regions to classes. Both the local module and the global module roll-out iteratively and cross-feed predictions to each other to refine estimates. The final predictions are made by combining the best of both modules with an attention mechanism. We show strong performance over plain ConvNets, \eg achieving an $8.4\%$ absolute improvement on ADE measured by per-class average precision. Analysis also shows that the framework is resilient to missing regions for reasoning.

翻译：我们提出了一个用于迭代视觉推理的新框架。我们的框架超越了目前缺乏超越堆叠的理性能力的现有识别系统。框架由两个核心模块组成: 一个使用空间内存存储先前信念的本地模块, 并同时更新; 和一个全球图表推算模块。我们的图形模块有三个组成部分 : a) 一个知识图, 我们将各个类别作为节点, 并构建边际以编码它们之间的不同类型的语义关系; b) 一个区域图, 图像中的区域是节点, 这些区域之间的空间关系是边缘; c) 一个分配图, 将区域分配到各个类别。本地模块和全球模块反复推出并交叉预测, 以完善估算。最后的预测是通过将两个模块的最好部分与关注机制相结合来做出的。我们显示了普通的ConNets的强性表现, 例如在以单级平均精确度测量的ADE上实现了8.4 $的绝对改善。分析还表明, 框架能够适应缺失的区域。