可视化指令调整 (Visual Instruction Tuning)

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

翻译：利用机器生成的指令跟随数据对大语言模型（LLMs）进行指令调整已经改进了对新任务的零-shot能力，但是该想法在多模态领域中被探索的较少。本文提出了使用仅依赖语言的 GPT-4 来生成多模态语言-图像指令跟随数据的第一个尝试。通过在这样的生成数据上进行指令调整，我们介绍了 LLaVA：大型语言及视觉辅助系统，一个连接视觉编码器和LLM的端到端大型多模态模型，用于一般性的视觉和语言理解。我们早期的实验表明，LLaVA展现了令人印象深刻的多模态聊天能力，有时在未见过的图片/指令上呈现出多模态GPT-4的行为，并在合成的多模态指令跟随数据集上相对GPT-4有85.1%的得分。当在科学QA上进行微调时，LLaVA和GPT-4的协同作用达到了92.53%的新的最高准确度。我们公开了GPT-4生成的视觉指令调整数据、我们的模型和代码库。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

阿里巴巴达摩院《从 mPLUG-Owl 浅析类GPT4模型的技术细节》

专知会员服务

57+阅读 · 2023年5月12日

5400亿！谷歌「Pathways语言模型」发布，能理解做推理生成代码

专知会员服务

40+阅读 · 2022年4月5日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日