Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores.
Personalized Safety Alignment (PSA) enables user-specific content moderation in text-to-image generation. Unlike global safety filters, PSA tailors outputs to individual user profiles, allowing nuanced alignment with diverse safety expectations.
We first construct the Sage Dataset, which includes 1,000 virtual users—each defined by attributes like age, religion, and mental health. Using LLMs, we infer which content categories each user finds unsafe, and cluster the resulting embeddings as shown below.
User embeddings clustered via t-SNE (k=5), showing diverse safety preferences.
For each concept, an LLM generates harmful/safe prompts to form supervision pairs.
For each sensitive concept, we generate a pair of prompts: one harmful and one safe. These are used to create corresponding image pairs, allowing us to construct user-conditioned training data. An LLM determines which version a user would prefer, enabling pairwise supervision per user.
Training pipeline: the model learns to prefer safe or unsafe images based on the user’s profile.
We extend Stable Diffusion and SDXL with cross-attention adapters that condition generation on the user embedding. The adapters introduce a lightweight control branch in each U-Net attention layer, allowing safety preferences to guide generation without retraining the entire model.
We evaluate PSA across two backbones (SD v1.5 and SDXL), using both quantitative metrics and qualitative examples. Results highlight PSA’s effectiveness in suppressing harmful content and respecting user preferences.
PSA significantly reduces the likelihood of generating unsafe content while preserving fidelity. Here, we show comparative performance on standard benchmarks.
PSA on SD v1.5 reduces inappropriate probability (IP) while preserving image quality.
On SDXL, PSA achieves even stronger suppression across all safety datasets.
Beyond global suppression, PSA excels in tailoring outputs to individual user constraints. We compare models on seen and unseen user profiles.
PSA achieves higher Win Rate over base model and SafetyDPO for both seen and unseen users.
PSA outperforms baselines in Pass Rate, showing better user-preference compliance.
The following visualization demonstrates how PSA dynamically adjusts visual outputs across multiple user safety levels.
PSA suppresses harmful content (e.g., hate, sexuality, self-harm) progressively from Level 1 to Level 5.
We further showcase PSA’s generalization across SDXL and SD v1.5 architectures. These examples span diverse harmful categories and validate consistent suppression through user conditioning.
PSA suppression from Level 1 to 5 on SDXL across harassment, hate, sexuality, shocking, and violence categories.
PSA suppression from Level 1 to 5 on SD v1.5 backbone, achieving consistent results under lighter architectures.
@article{lei2025psalign,
title={Personalized Safety Alignment for Text-to-Image Diffusion Models},
author={Yu Lei and Jinbin Bai and Qingyu Shi and Aosong Feng and Kaidong Yu},
journal={arXiv preprint arXiv:2507.xxxxx},
year={2025}
}