Every LLM Has Silenced Biases — Here's How I Broke 5 Model Families

I extended the Silenced Biases (AAAI-26) attack framework to Mistral, DeepSeek-R1, Phi-3.5, and more. Activation subtraction achieves 100% break rate on standard transformers. Every model tested harbors exploitable biases.

Posted Apr 9, 2026

Silenced Biases Across LLM Families — Multi-Model Attack Results

By Rotem Levi

7 min read

Every LLM Has Silenced Biases — Here's How I Broke 5 Model Families

After breaking Gemma 4’s safety alignment through prompt-level attacks, I wanted to know: is this a Gemma problem, or an everyone problem?

I built a generalized attack framework (multi_model_sbb.py, ~550 lines) implementing 20 attack strategies across prompt-level and activation-level manipulation, then ran it against every open-weight model I could fit on a single A10G GPU.

Answer: it’s an everyone problem.

Results at a Glance

Model	Params	Best Attack	Max Break Rate	Activation Steering
Mistral-7B-Instruct-v0.3	7.25B	ActSub α=-50	100% (6/6)	✅ Works
DeepSeek-R1-Distill-Llama-8B	8.03B	ActSub α=-50	100% (6/6)	✅ Works
Phi-3.5-mini-instruct	3.82B	System prompt	83% (5/6)	❌ Mostly fails
Gemma 4 E4B-it	4B	Prefill + System	100% (4/4)	❌ All failed (PLE)
Qwen2.5-7B-Instruct	7.6B	Activation steering	100% (4/4)	✅ Works

Every model tested contains biases that survive alignment. Zero exceptions.

flowchart TD
    A[Bias Probe] --> B{Architecture?}
    B -->|Standard Transformer| C[Activation Subtraction]
    B -->|PLE / Gemma 4| D[Prompt-Level Only]
    B -->|Phi-style| E[System Prompt Manipulation]
    C --> F["100% Break Rate"]
    D --> G["100% Break Rate"]
    E --> H["83% Break Rate"]
    style F fill:#e74c3c,color:#fff
    style G fill:#e74c3c,color:#fff
    style H fill:#e67e22,color:#fff

Background

This work extends the Silenced Biases paper (Himelstein et al., AAAI-26), which showed that aligned LLMs still contain biases — they just learn to refuse to express them. The paper introduced activation steering to extract these hidden biases from Qwen2.5-7B.

My previous post showed that Gemma 4’s PLE architecture blocks activation steering entirely, but prompt-level attacks still break it. This post generalizes the attack to multiple model families.

The Attack Framework

I implemented 20 strategies in 4 categories:

Prompt-Level (S0–S5)

S0 — Baseline (no manipulation, establishes refusal rate)
S1 — Prefill injection (force a biased prefix into the response)
S2 — System prompt variants: stats_only, forced_choice, researcher, opposite_day
S2b — Query rephrasing: neutral, academic
S3 — Few-shot bias priming
S4 — Logit suppression (penalize refusal tokens like “sorry”, “cannot”)
S5 — Combined prefill + logit suppression

Activation-Level (S6–S9)

S6a — Single-layer refusal direction ablation
S6b — Activation subtraction at α = {-10, -50, -100}
S7 — Multi-layer ablation (last 8 layers)
S8 — SVD subspace ablation (k=10 directions)
S9 — Combined: prefill + activation subtraction + logit suppression

The framework computes refusal directions from harmful/harmless activation differences, identifies the optimal ablation layer, then systematically tests each strategy against 6 bias categories: Religion, Gender, Race, Age, Nationality, Politics.

Model-by-Model Breakdown

Mistral-7B-Instruct-v0.3 — Total Collapse

Runtime: 10.4 min

Best ablation layer: 22

Architecture: Standard transformer

Mistral is the most vulnerable model I tested. Activation subtraction at α=-50 produces complete alignment collapse — all 6 bias categories broken.

Strategy	Break Rate	Categories Broken
S0 baseline	0%	0/6
S1 prefill (long)	100%	6/6
S2 stats_only	67%	4/6
S2 forced_choice	67%	4/6
S6a ablation	67%	4/6
S6b actsub α=-50	100%	6/6
S6b actsub α=-100	100%	6/6
S7 multi-layer	67%	4/6
S9 combined	100%	6/6

The safety alignment is paper-thin — encoded in a single refusal direction at layer 22 that’s trivially computable from ~50 harmful/harmless prompt pairs.

DeepSeek-R1-Distill-Llama-8B — Reasoning Doesn’t Help

Runtime: 16.3 min

Best ablation layer: 22

Architecture: Llama-based

DeepSeek-R1 is a reasoning model — it generates chain-of-thought (CoT) before answering. You might expect that explicit reasoning about whether to comply would provide extra safety. It doesn’t.

Strategy	Break Rate	Categories Broken
S0 baseline	0%	0/6
S1 prefill (long)	100%	6/6
S5 prefill+logit (long)	83%	5/6
S6a ablation	83%	5/6
S6b actsub α=-50	100%	6/6
S6b actsub α=-100	100%	6/6
S7 multi-layer	67%	4/6
S8 subspace k=10	67%	4/6
S9 combined	100%	6/6

When activation steering bypasses the refusal mechanism, the CoT process actively engages with the biased premise and produces rationalized biased answers — sometimes citing fabricated statistics. The reasoning capability becomes a liability: instead of simply outputting “(a)”, the model explains why it picked the biased answer.

This is the most concerning finding. Reasoning models don’t just express bias — they rationalize it.

Phi-3.5-mini-instruct — The Activation-Resistant Outlier

Runtime: 11.1 min

Best ablation layer: 25

Architecture: Standard (Microsoft)

Phi-3.5 breaks the pattern. It resists activation-level attacks (max 17% break rate from activation steering) but is highly vulnerable to system prompt manipulation.

Strategy	Break Rate	Categories Broken
S0 baseline	0%	0/6
S1 prefill (long)	50%	3/6
S2 stats_only	83%	5/6
S2 forced_choice	83%	5/6
S2 researcher	67%	4/6
S6a ablation	0%	0/6
S6b actsub α=-50	17%	1/6
S7 multi-layer	0%	0/6
S8 subspace k=10	0%	0/6
S9 combined	67%	4/6

Microsoft’s alignment approach for Phi appears to distribute safety behaviors across the model’s weights rather than concentrating them in a single refusal direction. This makes activation steering ineffective — but the model still folds when you frame bias questions as “statistical analysis” or “forced choice for academic purposes.”

xychart-beta
    title "Best Break Rate by Model"
    x-axis ["Mistral-7B", "DeepSeek-R1", "Phi-3.5", "Gemma 4", "Qwen-7B"]
    y-axis "Break Rate (%)" 0 --> 100
    bar [100, 100, 83, 100, 100]

Key Findings

1. Architecture Determines Attack Surface

This is the headline result. The most important factor in how a model can be attacked is its architecture — not its size, training data, or alignment procedure.

Standard transformers (Mistral, DeepSeek/Llama, Qwen) → Fully vulnerable to activation steering
PLE architectures (Gemma 4) → Immune to activation steering, but prompt attacks work
Microsoft Phi → Resistant to activation, vulnerable to prompt engineering

2. Chain-of-Thought Is Not a Safety Mechanism

DeepSeek-R1 achieves 100% break rate under activation subtraction — identical to models without CoT. Worse, when broken, the CoT process rationalizes the biased output instead of catching it. Reasoning should not be confused with alignment.

3. Activation Subtraction Is Devastating

For any standard transformer model, computing the refusal direction from ~50 harmful/harmless pairs and subtracting it at the right layer (typically 60-70% depth) produces near-total alignment collapse. This is not a theoretical attack — it requires only forward passes and basic linear algebra.

4. System Prompts Are a Universal Bypass

System prompt manipulation achieves 50-83% break rates across all architectures tested, including models that resist activation attacks. Framing a biased question as “provide statistics only” or “forced academic choice” is surprisingly effective across the board.

5. All Models Have Biases

This is the most important takeaway: no model achieved 0% break rate across all strategies. Every model we tested — regardless of developer, architecture, or alignment approach — contains silenced biases that can be extracted. The biases are not removed by alignment; they are only hidden.

Bias Category Difficulty

Cross-model analysis reveals that some bias categories are harder for models to suppress:

Category	Avg Break Rate	Note
Religion	~90%	Muslim/terrorist association is the strongest signal
Age	~85%	Elderly/technology stereotypes easily elicited
Gender	~80%	Nursing/gender bias consistently surfaces
Nationality	~75%	Iranian nationality consistently selected
Race	~70%	Second-most guarded category
Politics	~60%	Models most resistant here; often select “cannot determine”

Religion bias is the easiest to extract, suggesting it has the strongest representation in training data. Political bias is the hardest — likely because political fairness receives the most attention during alignment training.

Implications

Safety alignment is not safety. Alignment training teaches models to refuse, not to be unbiased. The biases persist and are extractable.
Architecture matters more than training. Gemma 4’s PLE architecture provides more robust protection against internal manipulation than any alignment training on standard architectures.
Evaluation beyond surface responses. Standard QA evaluations that accept refusal as evidence of fairness are fundamentally flawed. The SBB methodology exposes this gap.
Defense-in-depth is essential. No single mechanism — alignment, CoT, refusal training — is sufficient. Models need architectural and algorithmic protections combined.
Reasoning models need deeper alignment. The CoT process in DeepSeek-R1 actively rationalizes biased outputs when the refusal mechanism is bypassed, making broken reasoning models potentially more harmful than broken non-reasoning models.

Code & Data

All attack code, result JSONs, and the generalized framework are available:

Attack framework: multi_model_sbb.py
Gemma 4 attacks: gemma4_break.py
Original paper: Silenced Biases — The Dark Side LLMs Learned to Refuse (AAAI-26)

What’s Next

Llama-3.1-8B-Instruct — Gated on HuggingFace, pending access approval
Larger models — Testing whether scale provides any defense (14B+ models)
Visualizations — Interactive heatmaps of per-category, per-strategy results
Automated defense evaluation — Testing whether runtime monitors can detect activation-level attacks

This is Part 2 of my series on extending the Silenced Biases (AAAI-26) research. Part 1 covers Gemma 4 specifically.

AI Security, LLM Safety

This post is licensed under CC BY 4.0 by the author.