Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
翻译:利用机器生成的指令跟随数据对大语言模型(LLMs)进行指令调整已经改进了对新任务的零-shot能力,但是该想法在多模态领域中被探索的较少。本文提出了使用仅依赖语言的 GPT-4 来生成多模态语言-图像指令跟随数据的第一个尝试。通过在这样的生成数据上进行指令调整,我们介绍了 LLaVA:大型语言及视觉辅助系统,一个连接视觉编码器和LLM的端到端大型多模态模型,用于一般性的视觉和语言理解。我们早期的实验表明,LLaVA展现了令人印象深刻的多模态聊天能力,有时在未见过的图片/指令上呈现出多模态GPT-4的行为,并在合成的多模态指令跟随数据集上相对GPT-4有85.1%的得分。当在科学QA上进行微调时,LLaVA和GPT-4的协同作用达到了92.53%的新的最高准确度。我们公开了GPT-4生成的视觉指令调整数据、我们的模型和代码库。