Personalized Safety Alignment for Text-to-Image Diffusion Models

1TeleAI, China Telecom, 2Peking University, 3Yale University, 4National University of Singapore
Project Lead, Corresponding Author
leiyu2648@gmail.com
Teaser Image

PSA is a novel framework for personalizing safety controls in text-to-image generation. Unlike existing models that apply a one-size-fits-all safety filter, PSA dynamically adjusts outputs based on each user's age, mental health, and personal values. The demo shows how PSA adapts image content across three user profiles with increasing safety sensitivity.

Abstract

Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores.

Method

Personalized Safety Alignment (PSA) enables user-specific content moderation in text-to-image generation. Unlike global safety filters, PSA tailors outputs to individual user profiles, allowing nuanced alignment with diverse safety expectations.

We first construct the Sage Dataset, which includes 1,000 virtual users—each defined by attributes like age, religion, and mental health. Using LLMs, we infer which content categories each user finds unsafe, and cluster the resulting embeddings as shown below.

User Embedding Clusters

User embeddings clustered via t-SNE (k=5), showing diverse safety preferences.

Concept-Pair Generation Process

For each concept, an LLM generates harmful/safe prompts to form supervision pairs.

For each sensitive concept, we generate a pair of prompts: one harmful and one safe. These are used to create corresponding image pairs, allowing us to construct user-conditioned training data. An LLM determines which version a user would prefer, enabling pairwise supervision per user.

Training Pipeline

Training pipeline: the model learns to prefer safe or unsafe images based on the user’s profile.

We extend Stable Diffusion and SDXL with cross-attention adapters that condition generation on the user embedding. The adapters introduce a lightweight control branch in each U-Net attention layer, allowing safety preferences to guide generation without retraining the entire model.

Experimental Results

We evaluate PSA across two backbones (SD v1.5 and SDXL), using both quantitative metrics and qualitative examples. Results highlight PSA’s effectiveness in suppressing harmful content and respecting user preferences.

General Harmful Content Suppression

PSA significantly reduces the likelihood of generating unsafe content while preserving fidelity. Here, we show comparative performance on standard benchmarks.

SD15 Suppression Table

PSA on SD v1.5 reduces inappropriate probability (IP) while preserving image quality.

SDXL Suppression Table

On SDXL, PSA achieves even stronger suppression across all safety datasets.

Personalized Safety Alignment

Beyond global suppression, PSA excels in tailoring outputs to individual user constraints. We compare models on seen and unseen user profiles.

Win Rate Chart

PSA achieves higher Win Rate over base model and SafetyDPO for both seen and unseen users.

Pass Rate Table

PSA outperforms baselines in Pass Rate, showing better user-preference compliance.

Qualitative Comparison

The following visualization demonstrates how PSA dynamically adjusts visual outputs across multiple user safety levels.

Qualitative Comparison

PSA suppresses harmful content (e.g., hate, sexuality, self-harm) progressively from Level 1 to Level 5.

More Qualitative Examples

We further showcase PSA’s generalization across SDXL and SD v1.5 architectures. These examples span diverse harmful categories and validate consistent suppression through user conditioning.

Qualitative Results - SDXL

PSA suppression from Level 1 to 5 on SDXL across harassment, hate, sexuality, shocking, and violence categories.

Qualitative Results - SD v1.5

PSA suppression from Level 1 to 5 on SD v1.5 backbone, achieving consistent results under lighter architectures.

BibTeX

@article{lei2025psalign,
  title={Personalized Safety Alignment for Text-to-Image Diffusion Models},
  author={Yu Lei and Jinbin Bai and Qingyu Shi and Aosong Feng and Kaidong Yu},
  journal={arXiv preprint arXiv:2507.xxxxx},
  year={2025}
}