概念引导的后门攻击在视觉语言模型中的应用 (Concept-Guided Backdoor Attack on Vision Language Models)

Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises increasing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While effective, these approaches reduce stealthiness and remain vulnerable to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than on raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, causing the model to behave normally in all other cases but consistently inject malicious outputs whenever the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference time to keep the VLM unchanged. This design enables systematic replacement of a targeted label in generated text (for example, replacing "cat" with "dog"), even when the replacement behavior never appears in the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates while maintaining moderate impact on clean-task performance. These findings highlight concept-level vulnerabilities as a critical new attack surface for VLMs.

翻译：视觉语言模型（VLMs）在多模态文本生成方面取得了显著进展，但其快速普及引发了对其安全漏洞日益增长的担忧。现有针对VLMs的后门攻击主要依赖于显式的像素级触发器或注入图像中的不可察觉扰动。尽管这些方法有效，但它们降低了隐蔽性，并且仍然容易受到基于图像的防御措施的影响。我们引入了概念引导的后门攻击，这是一种在语义概念层面而非原始像素上操作的新范式。我们提出了两种不同的攻击方法。第一种是概念阈值中毒（CTP），它使用自然图像中的显式概念作为触发器：仅包含目标概念的样本被污染，导致模型在所有其他情况下表现正常，但只要该概念出现，就会持续注入恶意输出。第二种是CBL引导的未见后门（CGUB），在训练过程中利用概念瓶颈模型（CBM）干预内部概念激活，同时在推理时丢弃CBM分支以保持VLM不变。这种设计使得在生成文本中系统性地替换目标标签（例如，将“猫”替换为“狗”）成为可能，即使替换行为从未出现在训练数据中。在多种VLM架构和数据集上的实验表明，CTP和CGUB均实现了较高的攻击成功率，同时对干净任务性能的影响保持在适度水平。这些发现凸显了概念级漏洞作为VLMs的一个关键新攻击面。