We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects. IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98.
翻译:我们引入互动式问答(IQA), 即回答需要自主代理器与动态视觉环境互动的问题的任务。 IQA 向代理器提供一个场景和一个问题, 比如 : “ 冰箱里有苹果吗? ” 代理器必须环绕场景, 获得对场景元素的视觉理解, 与对象( 如开放的冰箱) 互动, 并计划一系列以问题为条件的行动。 与单个控制器的大众强化学习方法在IQA 上表现不佳, 原因是国家空间大而多样。 我们提议由一组因子化控制器组成的等级互动内存网络( HIMN), 使系统能够在多个层次的时空抽象操作。 为了评估 HEMN, 我们引入 IQUAD V1, 一个建立在 AI2- THOR 上的新数据集, 一个模拟的照片现实环境, 与互动对象相容的室内镜框。 IQUAD V1 有75,000个问题, 每个问题配对一个独特的场景配置。 我们的实验显示, 我们提议的模型超越了基于 IQUAD/ X1的流行的单一控制器方法: http:// X1 样本问题和结果。