Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
翻译:最近多式机器学习和人工智能(AI)领域的进步导致在计算机视野、自然语言处理和人工智能的交汇点制定具有挑战性的任务。虽然许多办法和以往的调查追求都具有其中一两个层面的特点,但在所有三个层面中都没有进行整体分析。此外,即使考虑这些专题的组合,也更侧重于描述,例如目前的建筑方法,而不是也说明实地的高层次挑战和机遇。在本调查文件中,我们讨论了Embodied Vision-Language Plan(EVLP)任务,这是一组突出的内含导航和操作问题,共同使用计算机视野和自然语言。我们建议进行分类,以统一这些任务,深入分析和比较新的和当前的算法方法、衡量标准、模拟环境以及用于EVLP任务的数据集。最后,我们介绍了我们认为新的EVLP工作应该设法解决的核心挑战,我们主张任务构建能够使模型具有通用性和更进一步实际部署。