FlipLLM：基于强化学习的高效多模态大语言模型位翻转攻击方法 (FlipLLM: Efficient Bit-Flip Attacks on Multimodal LLMs using Reinforcement Learning)

Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggle to scale, often failing to analyze the vast parameter space and complex interdependencies of modern foundation models in a reasonable time. This paper proposes FlipLLM, a reinforcement learning (RL) architecture-agnostic framework that formulates BFA discovery as a sequential decision-making problem. FlipLLM combines sensitivity-guided layer pruning with Q-learning to efficiently identify minimal, high-impact bit sets that can induce catastrophic failure. We demonstrate the effectiveness and generalizability of FlipLLM by applying it to a diverse set of models, including prominent text-only LLMs (GPT-2 Large, LLaMA 3.1 8B, and DeepSeek-V2 7B), VLMs such as LLaVA 1.6, and datasets, such as MMLU, MMLU-Pro, VQAv2, and TextVQA. Our results show that FlipLLM can identify critical bits that are vulnerable to BFAs up to 2.5x faster than SOTA methods. We demonstrate that flipping the FlipLLM-identified bits plummets the accuracy of LLaMA 3.1 8B from 69.9% to ~0.2%, and for LLaVA's VQA score from 78% to almost 0%, by flipping as few as 5 and 7 bits, respectively. Further analysis reveals that applying standard hardware protection mechanisms, such as ECC SECDED, to the FlipLLM-identified bit locations completely mitigates the BFA impact, demonstrating the practical value of our framework in guiding hardware-level defenses. FlipLLM offers the first scalable and adaptive methodology for exploring the BFA vulnerability of both language and multimodal foundation models, paving the way for comprehensive hardware-security evaluation.

翻译：生成式人工智能模型，例如大型语言模型（LLMs）和大型视觉模型（VLMs），虽然展现出最先进的性能，但仍易受基于硬件的威胁，特别是位翻转攻击（BFAs）。现有的BFA发现方法缺乏泛化能力且难以扩展，通常无法在合理时间内分析现代基础模型庞大的参数空间和复杂的相互依赖关系。本文提出FlipLLM，一种与强化学习（RL）架构无关的框架，将BFA发现建模为一个序列决策问题。FlipLLM结合了敏感性引导的层剪枝与Q学习，以高效识别能够引发灾难性故障的最小、高影响力位集合。我们通过将FlipLLM应用于一系列多样化模型（包括主流的纯文本LLMs（GPT-2 Large、LLaMA 3.1 8B和DeepSeek-V2 7B）、VLMs如LLaVA 1.6）以及数据集（如MMLU、MMLU-Pro、VQAv2和TextVQA），证明了其有效性和泛化能力。我们的结果表明，FlipLLM识别易受BFA攻击的关键位比最先进方法快达2.5倍。我们证明，翻转FlipLLM识别的位（分别仅翻转5位和7位）可使LLaMA 3.1 8B的准确率从69.9%骤降至约0.2%，并使LLaVA的VQA得分从78%降至接近0%。进一步分析表明，对FlipLLM识别的位位置应用标准硬件保护机制（如ECC SECDED）可完全缓解BFA影响，这证明了我们的框架在指导硬件级防御方面的实用价值。FlipLLM为探索语言和多模态基础模型的BFA漏洞提供了首个可扩展且自适应的方法论，为全面的硬件安全评估铺平了道路。