多模态提示优化：为何不利用多模态提升MLLMs性能 (Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs)

Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.

翻译：大型语言模型（LLMs）已展现出显著的成功，其多模态扩展（MLLMs）进一步释放了涵盖图像、视频及其他超越文本的模态能力。然而，尽管存在这一转变，旨在减轻人工提示设计负担并最大化性能的提示优化方法仍局限于文本，最终限制了MLLMs的全部潜力。受此差距驱动，我们提出了多模态提示优化这一新问题，将先前提示优化的定义扩展至由文本与非文本提示对所定义的多模态空间。为解决此问题，我们进一步提出多模态提示优化器（MPO），这是一个统一框架，不仅通过保持对齐的更新实现多模态提示的联合优化，还利用早期评估作为贝叶斯选择策略中的先验知识，指导候选提示的选择过程。通过在超越文本的多种模态（如图像、视频甚至分子）上进行广泛实验，我们证明MPO优于领先的纯文本优化方法，确立了多模态提示优化作为实现MLLMs潜力的关键步骤。