The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.
翻译:大型语言模型(LLMs)的出现显著重塑了人工智能革命的轨迹。然而,这些LLMs存在明显局限,主要擅长处理文本信息。为突破此限制,研究者致力于将视觉能力与LLMs融合,从而催生了视觉-语言模型(VLMs)。这些先进模型在处理图像描述生成和视觉问答等复杂任务中发挥关键作用。在本综述论文中,我们深入探讨了VLMs领域的重要进展。我们依据模型处理与生成多模态数据的能力和功能,将VLMs划分为三类:专注于视觉-语言理解的模型、处理多模态输入以生成单模态(文本)输出的模型,以及同时接收与生成多模态输入输出的模型。我们对每个模型进行细致剖析,尽可能详尽分析其基础架构、训练数据源、优势与局限,为读者提供对其核心要素的全面理解。我们还评估了VLMs在多种基准数据集上的表现,以期呈现对VLM多样化格局的深刻洞察。此外,我们着重指出了这一动态领域中未来研究的潜在路径,展望进一步的突破与进展。