In this paper we propose a new framework - MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non-reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from the traditional end-to-end techniques in this space and allows for a more tractable training process with separate vision and language data sets. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following. We demonstrate a significant increase in success rates for long-horizon, compositional tasks over the baseline on the recently released benchmark data set-ALFRED.
翻译:在本文件中,我们提出了一个新的框架 -- -- MoViLan(Modular ViLan)(Modular Vision and langues),用于执行日常室内家务的视觉天然语言指令。虽然已经根据愿景和语言模式为有针对性的导航任务提出了若干数据驱动的、端到端学习框架,但最近基准数据集的绩效显示,在开发综合长视线技术、包含不同对象类别的构成任务(涉及操纵和导航)、现实指令和视觉情景以及不可逆状态变化等方面,存在着差距。我们提出了一种模块化方法,用以处理综合导航和对象互动问题,而不需要严格一致的愿景和语言培训数据(例如,以专家展示的轨迹为形式)。这种方法大大偏离了这一空间传统的端到端技术,并使得能够利用不同的愿景和语言数据集开展更可移动的培训进程。具体地说,我们提出了用于封闭的室内环境的新型几何测量测量绘图技术,以及随后普及的家庭教学语言理解模型。我们展示了长正弦的成功率大幅提高,比最近发布的基准数据基准A。