Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model's generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model's original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region's attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at https://github.com/lorraine021/NDM.
翻译:尽管文本到图像(T2I)扩散模型展现出令人印象深刻的生成能力,它们仍然容易生成不当内容,尤其是在面对隐含性暗示时。与显式有害提示不同,这些微妙的线索通常伪装成看似无害的术语,可能因底层模型偏见而意外触发色情内容,引发了严重的伦理担忧。然而,现有的检测方法主要针对识别显式性内容,因此难以检测这些隐含线索。微调方法虽然在一定程度上有效,但存在降低模型生成质量的风险,造成了不理想的权衡。为解决这一问题,我们提出了NDM,首个噪声驱动的检测与缓解框架,能够在保持模型原始生成能力的同时,检测并缓解T2I生成中的隐含恶意意图。具体而言,我们引入了两项关键创新:首先,我们利用早期阶段预测噪声的可分离性,开发了一种基于噪声的检测方法,能够以高准确性和高效性识别恶意内容;其次,我们提出了一种噪声增强的自适应负向引导机制,通过抑制显著区域的注意力来优化初始噪声,从而增强自适应负向引导在性内容缓解方面的有效性。实验上,我们在自然和对抗性数据集上验证了NDM,证明了其优于现有SOTA方法(包括SLD、UCE和RECE等)的性能。代码与资源可在 https://github.com/lorraine021/NDM 获取。