Title: 融合视觉语言的实体规划的核心挑战 (Core Challenges in Embodied Vision-Language Planning)

from arxiv, Extended Abstract accepted to the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023); special journal track for authors of published JAIR 2022 and AIJ 2022 papers. 6 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2106.13948

Recent advances in the areas of Multimodal Machine Learning and Artificial Intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Robotics. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly leverage computer vision and natural language for interaction in physical environments. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the current and new algorithmic approaches, metrics, simulators, and datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalisability and furthers real-world deployment.

翻译：Abstract: 最近，多模式机器学习和人工智能（AI）的领域取得了新的进展，这导致了在计算机视觉、自然语言处理和机器人技术交叉领域上的一系列具有挑战性的任务的发展。虽然许多方法和以前的调查追求已经表征了这些维度中的一个或两个，但在所有三个领域的中心还没有进行全面分析。此外，即使考虑这些主题的组合，也更多地集中于描述例如当前的架构方法，而不仅仅是说明这个领域的高级挑战和机会。在本次调查论文中，我们讨论了实体视觉语言规划（EVLP）任务，这是一系列突出的实体导航和操作问题，共同利用计算机视觉和自然语言进行物理环境交互。我们提出了一个分类法来统一这些任务，并对用于EVLP任务的当前和新的算法方法、度量标准、模拟器和数据集进行了深入分析和比较。最后，我们提出了我们认为新的EVLP工作应该寻求解决的核心挑战，并倡导任务构建，以便实现模型通用性并促进实现现实世界的部署。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

36+阅读 · 2022年3月25日

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

专知会员服务

44+阅读 · 2022年3月8日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日

【硬核课】机器人学习课程，UT Austin朱玉可博士讲述自主机器人的人工智能与机器学习机器学习算法

专知会员服务

40+阅读 · 2020年9月21日