The research community has shown increasing interest in designing intelligent embodied agents that can assist humans in accomplishing tasks. Despite recent progress on related vision-language benchmarks, most prior work has focused on building agents that follow instructions rather than endowing agents the ability to ask questions to actively resolve ambiguities arising naturally in embodied environments. To empower embodied agents with the ability to interact with humans, in this work, we propose an Embodied Learning-By-Asking (ELBA) model that learns when and what questions to ask to dynamically acquire additional information for completing the task. We evaluate our model on the TEACH vision-dialog navigation and task completion dataset. Experimental results show that ELBA achieves improved task performance compared to baseline models without question-answering capabilities.
翻译:研究界对设计能帮助人类完成任务的具有智能色彩的代理物表现出越来越大的兴趣。尽管在相关的愿景语言基准方面最近取得了进展,但先前的大部分工作侧重于培养遵循指示的代理物,而不是赋予代理物提出问题的能力,以积极解决在体现环境中自然产生的模糊不清问题。为了赋予具有与人类互动能力的具有代表性的代理物以权力,我们提议在这项工作中采用一个 " 嵌入式学习 -- -- 双向学习(ELBA) " (ELBA)模式,以了解何时和哪些问题需要积极获取更多的信息来完成任务。我们评估了我们在 " TAEACH-愿景对口导航和任务完成数据集 " 上的模型。实验结果显示,与基线模型相比,ELBA在没有问答能力的情况下实现了更好的任务绩效。