Personalized Safety Alignment for Text-to-Image Diffusion Models

Yu Lei,

Jinbin Bai^†,

Qingyu Shi,

Aosong Feng,

Kaidong Yu^‡

¹TeleAI, China Telecom, ²Peking University, ³Yale University, ⁴National University of Singapore
^†Project Lead, ^‡Corresponding Author
leiyu2648@gmail.com

Arxiv Code Dataset Model

PSA is a novel framework for personalizing safety controls in text-to-image generation. Unlike existing models that apply a one-size-fits-all safety filter, PSA dynamically adjusts outputs based on each user's age, mental health, and personal values. The demo shows how PSA adapts image content across three user profiles with increasing safety sensitivity.

Abstract

Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores.

Method

Personalized Safety Alignment (PSA) enables user-specific content moderation in text-to-image generation. Unlike global safety filters, PSA tailors outputs to individual user profiles, allowing nuanced alignment with diverse safety expectations.

We first construct the Sage Dataset, which includes 1,000 virtual users—each defined by attributes like age, religion, and mental health. Using LLMs, we infer which content categories each user finds unsafe, and cluster the resulting embeddings as shown below.

User embeddings clustered via t-SNE (k=5), showing diverse safety preferences.

For each concept, an LLM generates harmful/safe prompts to form supervision pairs.

For each sensitive concept, we generate a pair of prompts: one harmful and one safe. These are used to create corresponding image pairs, allowing us to construct user-conditioned training data. An LLM determines which version a user would prefer, enabling pairwise supervision per user.

Training pipeline: the model learns to prefer safe or unsafe images based on the user’s profile.

We extend Stable Diffusion and SDXL with cross-attention adapters that condition generation on the user embedding. The adapters introduce a lightweight control branch in each U-Net attention layer, allowing safety preferences to guide generation without retraining the entire model.

Experimental Results

We evaluate PSA across two backbones (SD v1.5 and SDXL), using both quantitative metrics and qualitative examples. Results highlight PSA’s effectiveness in suppressing harmful content and respecting user preferences.

General Harmful Content Suppression

PSA significantly reduces the likelihood of generating unsafe content while preserving fidelity. Here, we show comparative performance on standard benchmarks.

PSA on SD v1.5 reduces inappropriate probability (IP) while preserving image quality.

On SDXL, PSA achieves even stronger suppression across all safety datasets.

Personalized Safety Alignment

Beyond global suppression, PSA excels in tailoring outputs to individual user constraints. We compare models on seen and unseen user profiles.

PSA achieves higher Win Rate over base model and SafetyDPO for both seen and unseen users.

PSA outperforms baselines in Pass Rate, showing better user-preference compliance.

Qualitative Comparison

The following visualization demonstrates how PSA dynamically adjusts visual outputs across multiple user safety levels.

PSA suppresses harmful content (e.g., hate, sexuality, self-harm) progressively from Level 1 to Level 5.

More Qualitative Examples

We further showcase PSA’s generalization across SDXL and SD v1.5 architectures. These examples span diverse harmful categories and validate consistent suppression through user conditioning.

PSA suppression from Level 1 to 5 on SDXL across harassment, hate, sexuality, shocking, and violence categories.

PSA suppression from Level 1 to 5 on SD v1.5 backbone, achieving consistent results under lighter architectures.

BibTeX

@article{lei2025personalized,
  title={Personalized Safety Alignment for Text-to-Image Diffusion Models},
  author={Lei, Yu and Bai, Jinbin and Shi, Qingyu and Feng, Aosong and Yu, Kaidong},
  journal={arXiv preprint arXiv:2508.01151},
  year={2025}
}