基于表征对比评分的大型视觉语言模型越狱检测再思考 (Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring)

Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.

翻译：大型视觉语言模型（LVLMs）易受日益增多的多模态越狱攻击，因此需要既能够泛化至新型威胁又便于实际部署的防御机制。当前许多策略存在不足：要么针对特定攻击模式，限制了泛化能力；要么带来高昂的计算开销。尽管轻量级异常检测方法提供了一个有前景的方向，但我们发现其常见的单类设计往往将新型良性输入与恶意输入混淆，导致不可靠的过度拒绝。为解决此问题，我们提出了表征对比评分（RCS）框架，其核心洞见在于：最有效的安全信号存在于LVLM自身的内部表征中。我们的方法检查这些表征的内部几何结构，学习一个轻量级投影，以在安全关键层中最大化分离良性与恶意输入。这实现了一个简单而强大的对比评分，能够区分真实的恶意意图与单纯的新颖性。我们的具体实现方法——马氏距离对比检测（MCD）和K近邻对比检测（KCD）——在一个旨在测试对未见攻击类型泛化能力的挑战性评估协议上取得了最先进的性能。本研究表明，通过对适当的内部表征应用简单、可解释的统计方法，可以实现有效的越狱检测，为更安全的LVLM部署提供了一条实用路径。我们的代码已在Github上开源：https://github.com/sarendis56/Jailbreak_Detection_RCS。