The rapid proliferation of Large Language Models (LLMs) has raised significant concerns about their security against adversarial attacks. In this work, we propose a novel approach to crafting universal jailbreaks and data extraction attacks by exploiting latent space discontinuities, an architectural vulnerability related to the sparsity of training data. Unlike previous methods, our technique generalizes across various models and interfaces, proving highly effective in seven state-of-the-art LLMs and one image generation model. Initial results indicate that when these discontinuities are exploited, they can consistently and profoundly compromise model behavior, even in the presence of layered defenses. The findings suggest that this strategy has substantial potential as a systemic attack vector.
翻译:大语言模型的快速扩散引发了对其对抗攻击安全性的重大关切。本研究提出一种新颖方法,通过利用潜在空间不连续性——一种与训练数据稀疏性相关的架构脆弱性——来构建通用越狱和数据提取攻击。与先前方法不同,我们的技术能泛化至多种模型和接口,在七种前沿大语言模型和一种图像生成模型中均表现出高效性。初步结果表明,当利用这些不连续性时,即使在多层防御存在的情况下,仍能持续且深度地破坏模型行为。研究结果表明该策略具有作为系统性攻击向量的巨大潜力。