Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALLE, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter's limitations and improve it. We first show that it is easy to generate disturbing content that bypasses the safety filter. We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content. Based on our analysis, we argue safety measures in future model releases should strive to be fully open and properly documented to stimulate security contributions from the community.
翻译:稳定的传播是一种与诸如DALLE、 immonn 或 Parti 等专有模型相似的近期开放源代码图像生成模型。 稳定传播带有安全过滤器, 目的是防止生成清晰的图像。 不幸的是, 过滤器模糊不清, 记录不全。 这使得用户难以防止应用程序中的滥用, 也难以理解过滤器的局限性并加以改进。 我们首先显示, 生成绕过安全过滤器的令人不安的内容很容易。 然后, 我们逆向设计过滤器, 发现它虽然旨在防止性内容, 却忽略暴力、 gore 和其他类似的令人不安的内容。 根据我们的分析, 我们主张, 未来模式释放的安全措施应该努力做到完全开放和妥善记录, 以刺激社区的安全贡献 。