While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
翻译:尽管大型视觉语言模型(LVLM)在现实世界应用中的部署日益增多,但其解释抽象视觉输入的能力仍然有限。具体而言,它们在理解手绘草图方面存在困难,而草图这种模态提供了一种直观表达难以用文本描述的概念的方式。我们发现,主要的瓶颈在于缺乏一个联合建模草图、逼真图像及相应自然语言指令的大规模数据集。为解决此问题,我们提出了两项关键贡献:(1)一个新颖的大规模图像-草图-指令三元组数据集,旨在同时促进预训练和指令微调;(2)O3SLM,一个基于此数据集训练的大型视觉语言模型。在多个草图相关任务上的综合评估:(a)目标定位,(b)计数,(c)图像检索(即SBIR与细粒度SBIR),以及(d)视觉问答(VQA);同时结合了三个现有草图数据集(QuickDraw!、Sketchy和Tu Berlin)以及我们生成的SketchVCL数据集,结果表明O3SLM实现了最先进的性能,在草图理解与推理方面显著优于现有的大型视觉语言模型。