Vision-language models (VLMs) possess rich knowledge but often fail on hierarchical understanding tasks, where the goal is to predict a coarse-to-fine taxonomy path that remains consistent across all levels. We compare three inference paradigms for hierarchical VQA and find that stepwise reasoning, when conditioned on prior answers, significantly outperforms single-pass prompting. Further analysis indicates that the main limitation of current VLMs is their inability to maintain cross-level state, rather than a lack of taxonomic knowledge. Motivated by this diagnosis, we propose Self-Elicited Knowledge Distillation (SEKD), which requires no human labels or external tools: the same VLM is prompted to reason step by step and act as a teacher by exposing its hard labels, soft distributions, and decoder hidden states, while a single-pass student distills these signals. The student VLM remains efficient while approaching the accuracy of its multi-step teacher. It improves in-domain path consistency (HCA) by up to +29.50 percentage points, raises zero-shot HCA on an unseen taxonomy from 4.15% to 42.26%, and yields gains on challenging mathematical benchmarks. Because all supervision is self-elicited, SEKD scales to new taxonomies and datasets without annotation cost, providing a practical route to imbue compact VLMs with dependency-aware multi-step reasoning.
翻译:视觉语言模型(VLMs)具备丰富的知识,但在层次理解任务中往往表现不佳,这类任务的目标是预测一个从粗到细的分类学路径,并确保所有层级之间保持一致性。我们比较了层次视觉问答的三种推理范式,发现当以先前答案为条件时,逐步推理显著优于单次提示。进一步分析表明,当前VLMs的主要局限在于无法维持跨层级状态,而非缺乏分类学知识。基于这一诊断,我们提出了自引导知识蒸馏(SEKD),该方法无需人工标注或外部工具:同一VLM被提示进行逐步推理,并通过暴露其硬标签、软分布和解码器隐藏状态来充当教师,而一个单次推理的学生模型则对这些信号进行蒸馏。学生VLM在保持高效的同时,能够接近多步推理教师的准确性。该方法将领域内路径一致性(HCA)提升了高达29.50个百分点,将未见分类学上的零样本HCA从4.15%提高至42.26%,并在具有挑战性的数学基准测试中取得了增益。由于所有监督均为自引导生成,SEKD能够扩展到新的分类学和数据集而无需标注成本,为将依赖感知的多步推理能力注入紧凑型VLMs提供了一条实用途径。