Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks. When such models are deployed in real world environments, they inevitably interface with other entities and agents. For example, language models are often used to interact with human beings through dialogue, and visual perception models are used to autonomously navigate neighborhood streets. In response to these developments, new paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning. These paradigms leverage the existence of ever-larger datasets curated for multimodal, multitask, and generalist interaction. Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems that can interact effectively across a diverse range of applications such as dialogue, autonomous driving, healthcare, education, and robotics. In this manuscript, we examine the scope of foundation models for decision making, and provide conceptual tools and technical background for understanding the problem space and exploring new research directions. We review recent approaches that ground foundation models in practical decision making applications through a variety of methods such as prompting, conditional generative modeling, planning, optimal control, and reinforcement learning, and discuss common challenges and open problems in the field.
翻译:根据各种规模的数据预先培训的基础模型在各种视野和语言任务中表现出了非凡的能力。当这些模型在现实世界环境中部署时,它们不可避免地与其他实体和代理人相互作用。例如,语言模型常常被用来通过对话与人类互动,视觉认知模型被用来自主地巡视街道。为了应对这些发展,正在出现新的模型,用于培训基础模型,以便与其他代理人互动并进行长期推理。这些范例利用了为多式联运、多任务和一般主义互动而构建的日益扩大的数据集的存在。基础模型和决策的交叉点研究为创建强大的新系统提供了巨大的希望,这些系统能够有效地在诸如对话、自主驾驶、保健、教育和机器人等各种应用中进行互动。在本稿中,我们审查了决策基础模型的范围,并为理解问题空间和探索新的研究方向提供了概念工具和技术背景。我们审查了最近采用的方法,即通过各种方法,例如促进、有条件的配比模型、规划、优化控制、强化学习等,在实际决策应用中建立基础模型,并讨论共同的挑战和开放的问题。</s>