Embodied Instruction Following (EIF) studies how mobile manipulator robots should be controlled to accomplish long-horizon tasks specified by natural language instructions. While most research on EIF are conducted in simulators, the ultimate goal of the field is to deploy the agents in real life. As such, it is important to minimize the data cost required for training an agent, to help the transition from sim to real. However, many studies only focus on the performance and overlook the data cost -- modules that require separate training on extra data are often introduced without a consideration on deployability. In this work, we propose FILM++ which extends the existing work FILM with modifications that do not require extra data. While all data-driven modules are kept constant, FILM++ more than doubles FILM's performance. Furthermore, we propose Prompter, which replaces FILM++'s semantic search module with language model prompting. Unlike FILM++'s implementation that requires training on extra sets of data, no training is needed for our prompting based implementation while achieving better or at least comparable performance. Prompter achieves 42.64% and 45.72% on the ALFRED benchmark with high-level instructions only and with step-by-step instructions, respectively, outperforming the previous state of the art by 6.57% and 10.31%.
翻译:继(EIF)研究后,研究如何控制移动操纵机器人,以完成自然语言指令规定的长半径任务。虽然大多数关于EIF的研究都是在模拟器中进行的,但实地的最终目的是在现实生活中部署代理人。因此,必须尽量减少培训代理人所需的数据成本,以帮助从模拟向真实的过渡。然而,许多研究只侧重于性能,忽视数据成本 -- -- 通常不考虑是否可部署性就引入需要额外数据单独培训的模块。在这项工作中,我们提议胶片+,扩大现有工作胶片的修改不需要额外数据。虽然所有数据驱动的模块都保持不变,胶片++多于双倍胶片性能。此外,我们提议“催动器”,用语言模型来取代胶片+的语气搜索模块,帮助从模拟向真实过渡。与胶片+的安装需要额外数据集培训的情况不同,不需要培训来加快我们的执行,同时实现更好或最起码可比的性能。更快速地实现了42.64%和45.72 %的ALFREARD标准,仅以高水平指示为前10级标准。