Referring Expression Comprehension (REC) is one of the most important tasks in visual reasoning that requires a model to detect the target object referred by a natural language expression. Among the proposed pipelines, the one-stage Referring Expression Comprehension (OSREC) has become the dominant trend since it merges the region proposal and selection stages. Many state-of-the-art OSREC models adopt a multi-hop reasoning strategy because a sequence of objects is frequently mentioned in a single expression which needs multi-hop reasoning to analyze the semantic relation. However, one unsolved issue of these models is that the number of reasoning steps needs to be pre-defined and fixed before inference, ignoring the varying complexity of expressions. In this paper, we propose a Dynamic Multi-step Reasoning Network, which allows the reasoning steps to be dynamically adjusted based on the reasoning state and expression complexity. Specifically, we adopt a Transformer module to memorize & process the reasoning state and a Reinforcement Learning strategy to dynamically infer the reasoning steps. The work achieves the state-of-the-art performance or significant improvements on several REC datasets, ranging from RefCOCO (+, g) with short expressions, to Ref-Reasoning, a dataset with long and complex compositional expressions.
翻译:表达式理解(REC)是视觉推理中最重要的任务之一,它需要一种模型来检测自然语言表达方式中提及的目标对象。在拟议的管道中,一个阶段的表示式理解(OSREC)自它合并了区域提案和选择阶段以来已成为主导趋势。许多最先进的OSREC模型采用了多点推理战略,因为一个单一表达式中经常提到物体序列,需要多点推理来分析语义关系。然而,这些模型的一个尚未解决的问题是,推理步骤的数量需要在推理之前预先确定和固定,而忽略表达方式的不同复杂性。在本文件中,我们提议了一个动态多步推理网络,使推理步骤能够根据推理状态和表达复杂性进行动态调整。具体地说,我们采用了一个变换模块,以记忆和处理推理状态和强化学习战略,以动态推理步骤推理。工作在推理之前需要事先确定和固定的推理步骤数量,而忽略表达式的复杂多步调,从若干复杂的REC 和变式数据组合(REf set) 实现状态或重大改进。