Die Chen Papers

NeurIPS Conference 2025 Conference Paper

Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion models

Die Chen
Zhiwen Li
Cen Chen
Yuexiang Xie
Xiaodan Li
Jinyan Ye
Yingda Chen
Yaliang Li

Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.

PDF Details

ICLR Conference 2025 Conference Paper

Growth Inhibitors for Suppressing Inappropriate Image Concepts in Diffusion Models

Die Chen
Zhiwen Li 0001
Mingyuan Fan 0003
Cen Chen 0001
Wenmeng Zhou
Yanhao Wang 0001
Yaliang Li

Despite their remarkable image generation capabilities, text-to-image diffusion models inadvertently learn inappropriate concepts from vast and unfiltered training data, which leads to various ethical and business risks. Specifically, model-generated images may exhibit not safe for work (NSFW) content and style copyright infringements. The prompts that result in these problems often do not include explicit unsafe words; instead, they contain obscure and associative terms, which are referred to as *implicit unsafe prompts*. Existing approaches directly fine-tune models under textual guidance to alter the cognition of the diffusion model, thereby erasing inappropriate concepts. This not only requires concept-specific fine-tuning but may also incur catastrophic forgetting. To address these issues, we explore the representation of inappropriate concepts in the image space and guide them towards more suitable ones by injecting *growth inhibitors*, which are tailored based on the identified features related to inappropriate concepts during the diffusion process. Additionally, due to the varying degrees and scopes of inappropriate concepts, we train an adapter to infer the corresponding suppression scale during the injection process. Our method effectively captures the manifestation of subtle words at the image level, enabling direct and efficient erasure of target concepts without the need for fine-tuning. Through extensive experimentation, we demonstrate that our approach achieves superior erasure results with little effect on other normal concepts while preserving image quality and semantics.

Details

Possible papers

Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion models

Growth Inhibitors for Suppressing Inappropriate Image Concepts in Diffusion Models