Vision-language models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse and severely degrade zero-shot performance. We identify the root cause as modal asymmetry: while the visual encoder can extract discriminative features from unseen images, the text encoder is constrained by a fixed discrete vocabulary and cannot synthesize new semantic anchors. Existing approaches such as CoOp or LoRA provide only partial remedies, as they remain confined to the pre-trained semantic space. To overcome this bottleneck, we propose dynamic representation optimization, realized through the Guided Target-Matching Adaptation (GTMA) framework. At inference time, GTMA constructs a continuous pseudo-word embedding that best aligns with an OOD image's visual anchor, effectively bypassing vocabulary limitations. The optimization is driven by an adaptive gradient-based representation policy optimization algorithm, which incorporates semantic regularization to preserve plausibility and compatibility with the model's prior knowledge. Experiments on ImageNet-R and the VISTA-Beyond benchmark demonstrate that GTMA improves zero-shot and few-shot OOD accuracy by up to 15-20 percent over the base VLM while maintaining performance on in-distribution concepts. Ablation studies further confirm the necessity of pseudo-word optimization.
翻译:视觉语言模型在开放世界应用中面临挑战,其中分布外概念常引发跨模态对齐崩溃,严重降低零样本性能。我们将其根源归结为模态不对称性:视觉编码器能够从未见图像中提取判别性特征,而文本编码器受限于固定的离散词汇表,无法合成新的语义锚点。现有方法如CoOp或LoRA仅提供部分解决方案,因其仍受限于预训练的语义空间。为突破此瓶颈,我们提出动态表征优化方法,通过引导式目标匹配自适应框架实现。在推理阶段,GTMA构建与分布外图像视觉锚点最佳对齐的连续伪词嵌入,有效规避词汇表限制。该优化过程由自适应梯度表征策略优化算法驱动,该算法融入语义正则化以保持语义合理性及与模型先验知识的兼容性。在ImageNet-R和VISTA-Beyond基准测试上的实验表明,GTMA将零样本与少样本分布外准确率较基础视觉语言模型提升高达15-20%,同时保持对分布内概念的性能。消融研究进一步证实了伪词优化的必要性。