Identity-preserving models have led to notable progress in generating personalized content. Unfortunately, such models also exacerbate risks when misused, for instance, by generating threatening content targeting specific individuals. This paper introduces the \textbf{Attribute Misbinding Attack}, a novel method that poses a threat to identity-preserving models by inducing them to produce Not-Safe-For-Work (NSFW) content. The attack's core idea involves crafting benign-looking textual prompts to circumvent text-filter safeguards and leverage a key model vulnerability: flawed attribute binding that stems from its internal attention bias. This results in misattributing harmful descriptions to a target identity and generating NSFW outputs. To facilitate the study of this attack, we present the \textbf{Misbinding Prompt} evaluation set, which examines the content generation risks of current state-of-the-art identity-preserving models across four risk dimensions: pornography, violence, discrimination, and illegality. Additionally, we introduce the \textbf{Attribute Binding Safety Score (ABSS)}, a metric for concurrently assessing both content fidelity and safety compliance. Experimental results show that our Misbinding Prompt evaluation set achieves a \textbf{5.28}\% higher success rate in bypassing five leading text filters (including GPT-4o) compared to existing main-stream evaluation sets, while also demonstrating the highest proportion of NSFW content generation. The proposed ABSS metric enables a more comprehensive evaluation of identity-preserving models by concurrently assessing both content fidelity and safety compliance.
翻译:身份保持模型在生成个性化内容方面取得了显著进展。然而,此类模型若被滥用(例如生成针对特定个体的威胁性内容)也会加剧风险。本文提出了一种名为**属性误绑定攻击**的新方法,该方法通过诱导模型生成不适宜工作场所(NSFW)内容,对身份保持模型构成威胁。该攻击的核心思想是设计看似良性的文本提示,以规避文本过滤防护机制,并利用模型的一个关键漏洞:源于其内部注意力偏见的缺陷属性绑定。这导致将有害描述错误归因于目标身份,并生成NSFW输出。为促进对此攻击的研究,我们提出了**误绑定提示**评估集,该评估集从色情、暴力、歧视和非法性四个风险维度,检验当前最先进身份保持模型的内容生成风险。此外,我们引入了**属性绑定安全评分(ABSS)**,这是一种同时评估内容保真度和安全合规性的指标。实验结果表明,与现有主流评估集相比,我们的误绑定提示评估集在绕过五种主流文本过滤器(包括GPT-4o)方面实现了**5.28%**更高的成功率,同时展现出最高的NSFW内容生成比例。所提出的ABSS指标通过同时评估内容保真度和安全合规性,实现了对身份保持模型更全面的评估。