Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research. However, data collection for these tasks is expensive and end-to-end approaches suffer from data inefficiency. We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently. Using a Language, Action, and Vision (LAV) framework removes the dependence of action and vision modules on instruction following datasets, making them more efficient to train. We also present a preliminary evaluation of LAV on the ALFRED task for visual and interactive instruction following.
翻译:自然语言教学是基础语言和机器人研究的宝贵测试台,然而,为这些任务收集数据费用昂贵,端到端方法因数据效率低下而受到影响。我们建议将语言、行动和视觉任务结构化为可以独立培训的单独模块。使用语言、行动和愿景(LAV)框架可以消除行动和愿景模块对数据集后教学的依赖,使其更高效地培训。我们还初步评估了LAV对ALFRED的视觉和互动指导任务。