Multi-modal large language models (MLLMs), capable of processing text, images, and audio, have been widely adopted in various AI applications. However, recent MLLMs integrating images and text remain highly vulnerable to coordinated jailbreaks. Existing defenses primarily focus on the text, lacking robust multi-modal protection. As a result, studies indicate that MLLMs are more susceptible to malicious or unsafe instructions, unlike their text-only counterparts. In this paper, we proposed DefenSee, a robust and lightweight multi-modal black-box defense technique that leverages image variants transcription and cross-modal consistency checks, mimicking human judgment. Experiments on popular multi-modal jailbreak and benign datasets show that DefenSee consistently enhances MLLM robustness while better preserving performance on benign tasks compared to SOTA defenses. It reduces the ASR of jailbreak attacks to below 1.70% on MiniGPT4 using the MM-SafetyBench benchmark, significantly outperforming prior methods under the same conditions.
翻译:多模态大语言模型(MLLMs)能够处理文本、图像与音频,已在各类人工智能应用中广泛部署。然而,近期整合图像与文本的MLLMs仍极易遭受协同越狱攻击。现有防御方法主要聚焦于文本模态,缺乏稳健的多模态保护机制。研究表明,与纯文本模型相比,MLLMs对恶意或不安全指令的抵御能力更为薄弱。本文提出DefenSee——一种基于图像变体转录与跨模态一致性校验的轻量级多模态黑盒防御技术,其机制模拟人类判断逻辑。在主流多模态越狱及良性数据集上的实验表明,相较于前沿防御方法,DefenSee在持续增强MLLMs鲁棒性的同时,能更好地保持模型在良性任务上的性能。基于MM-SafetyBench基准测试,该方法将MiniGPT4的越狱攻击成功率降至1.70%以下,在同等条件下显著优于现有方法。