Every LLM Has Silenced Biases — Here's How I Broke 5 Model Families
I extended the Silenced Biases (AAAI-26) attack framework to Mistral, DeepSeek-R1, Phi-3.5, and more. Activation subtraction achieves 100% break rate on standard transformers. Every model tested harbors exploitable biases.
After breaking Gemma 4’s safety alignment through prompt-level attacks, I wanted to know: is this a Gemma problem, or an everyone problem?
I built a generalized attack framework (multi_model_sbb.py, ~550 lines) implementing 20 attack strategies across prompt-level and activation-level manipulation, then ran it against every open-weight model I could fit on a single A10G GPU.
Answer: it’s an everyone problem.
Results at a Glance
| Model | Params | Best Attack | Max Break Rate | Activation Steering |
|---|---|---|---|---|
| Mistral-7B-Instruct-v0.3 | 7.25B | ActSub α=-50 | 100% (6/6) | ✅ Works |
| DeepSeek-R1-Distill-Llama-8B | 8.03B | ActSub α=-50 | 100% (6/6) | ✅ Works |
| Phi-3.5-mini-instruct | 3.82B | System prompt | 83% (5/6) | ❌ Mostly fails |
| Gemma 4 E4B-it | 4B | Prefill + System | 100% (4/4) | ❌ All failed (PLE) |
| Qwen2.5-7B-Instruct | 7.6B | Activation steering | 100% (4/4) | ✅ Works |
Every model tested contains biases that survive alignment. Zero exceptions.
flowchart TD
A[Bias Probe] --> B{Architecture?}
B -->|Standard Transformer| C[Activation Subtraction]
B -->|PLE / Gemma 4| D[Prompt-Level Only]
B -->|Phi-style| E[System Prompt Manipulation]
C --> F["100% Break Rate"]
D --> G["100% Break Rate"]
E --> H["83% Break Rate"]
style F fill:#e74c3c,color:#fff
style G fill:#e74c3c,color:#fff
style H fill:#e67e22,color:#fff
Background
This work extends the Silenced Biases paper (Himelstein et al., AAAI-26), which showed that aligned LLMs still contain biases — they just learn to refuse to express them. The paper introduced activation steering to extract these hidden biases from Qwen2.5-7B.
My previous post showed that Gemma 4’s PLE architecture blocks activation steering entirely, but prompt-level attacks still break it. This post generalizes the attack to multiple model families.
The Attack Framework
I implemented 20 strategies in 4 categories:
Prompt-Level (S0–S5)
- S0 — Baseline (no manipulation, establishes refusal rate)
- S1 — Prefill injection (force a biased prefix into the response)
- S2 — System prompt variants:
stats_only,forced_choice,researcher,opposite_day - S2b — Query rephrasing: neutral, academic
- S3 — Few-shot bias priming
- S4 — Logit suppression (penalize refusal tokens like “sorry”, “cannot”)
- S5 — Combined prefill + logit suppression
Activation-Level (S6–S9)
- S6a — Single-layer refusal direction ablation
- S6b — Activation subtraction at α = {-10, -50, -100}
- S7 — Multi-layer ablation (last 8 layers)
- S8 — SVD subspace ablation (k=10 directions)
- S9 — Combined: prefill + activation subtraction + logit suppression
The framework computes refusal directions from harmful/harmless activation differences, identifies the optimal ablation layer, then systematically tests each strategy against 6 bias categories: Religion, Gender, Race, Age, Nationality, Politics.
Model-by-Model Breakdown
Mistral-7B-Instruct-v0.3 — Total Collapse
| Runtime: 10.4 min | Best ablation layer: 22 | Architecture: Standard transformer |
Mistral is the most vulnerable model I tested. Activation subtraction at α=-50 produces complete alignment collapse — all 6 bias categories broken.
| Strategy | Break Rate | Categories Broken |
|---|---|---|
| S0 baseline | 0% | 0/6 |
| S1 prefill (long) | 100% | 6/6 |
| S2 stats_only | 67% | 4/6 |
| S2 forced_choice | 67% | 4/6 |
| S6a ablation | 67% | 4/6 |
| S6b actsub α=-50 | 100% | 6/6 |
| S6b actsub α=-100 | 100% | 6/6 |
| S7 multi-layer | 67% | 4/6 |
| S9 combined | 100% | 6/6 |
The safety alignment is paper-thin — encoded in a single refusal direction at layer 22 that’s trivially computable from ~50 harmful/harmless prompt pairs.
DeepSeek-R1-Distill-Llama-8B — Reasoning Doesn’t Help
| Runtime: 16.3 min | Best ablation layer: 22 | Architecture: Llama-based |
DeepSeek-R1 is a reasoning model — it generates chain-of-thought (CoT) before answering. You might expect that explicit reasoning about whether to comply would provide extra safety. It doesn’t.
| Strategy | Break Rate | Categories Broken |
|---|---|---|
| S0 baseline | 0% | 0/6 |
| S1 prefill (long) | 100% | 6/6 |
| S5 prefill+logit (long) | 83% | 5/6 |
| S6a ablation | 83% | 5/6 |
| S6b actsub α=-50 | 100% | 6/6 |
| S6b actsub α=-100 | 100% | 6/6 |
| S7 multi-layer | 67% | 4/6 |
| S8 subspace k=10 | 67% | 4/6 |
| S9 combined | 100% | 6/6 |
When activation steering bypasses the refusal mechanism, the CoT process actively engages with the biased premise and produces rationalized biased answers — sometimes citing fabricated statistics. The reasoning capability becomes a liability: instead of simply outputting “(a)”, the model explains why it picked the biased answer.
This is the most concerning finding. Reasoning models don’t just express bias — they rationalize it.
Phi-3.5-mini-instruct — The Activation-Resistant Outlier
| Runtime: 11.1 min | Best ablation layer: 25 | Architecture: Standard (Microsoft) |
Phi-3.5 breaks the pattern. It resists activation-level attacks (max 17% break rate from activation steering) but is highly vulnerable to system prompt manipulation.
| Strategy | Break Rate | Categories Broken |
|---|---|---|
| S0 baseline | 0% | 0/6 |
| S1 prefill (long) | 50% | 3/6 |
| S2 stats_only | 83% | 5/6 |
| S2 forced_choice | 83% | 5/6 |
| S2 researcher | 67% | 4/6 |
| S6a ablation | 0% | 0/6 |
| S6b actsub α=-50 | 17% | 1/6 |
| S7 multi-layer | 0% | 0/6 |
| S8 subspace k=10 | 0% | 0/6 |
| S9 combined | 67% | 4/6 |
Microsoft’s alignment approach for Phi appears to distribute safety behaviors across the model’s weights rather than concentrating them in a single refusal direction. This makes activation steering ineffective — but the model still folds when you frame bias questions as “statistical analysis” or “forced choice for academic purposes.”
xychart-beta
title "Best Break Rate by Model"
x-axis ["Mistral-7B", "DeepSeek-R1", "Phi-3.5", "Gemma 4", "Qwen-7B"]
y-axis "Break Rate (%)" 0 --> 100
bar [100, 100, 83, 100, 100]
Key Findings
1. Architecture Determines Attack Surface
This is the headline result. The most important factor in how a model can be attacked is its architecture — not its size, training data, or alignment procedure.
- Standard transformers (Mistral, DeepSeek/Llama, Qwen) → Fully vulnerable to activation steering
- PLE architectures (Gemma 4) → Immune to activation steering, but prompt attacks work
- Microsoft Phi → Resistant to activation, vulnerable to prompt engineering
2. Chain-of-Thought Is Not a Safety Mechanism
DeepSeek-R1 achieves 100% break rate under activation subtraction — identical to models without CoT. Worse, when broken, the CoT process rationalizes the biased output instead of catching it. Reasoning should not be confused with alignment.
3. Activation Subtraction Is Devastating
For any standard transformer model, computing the refusal direction from ~50 harmful/harmless pairs and subtracting it at the right layer (typically 60-70% depth) produces near-total alignment collapse. This is not a theoretical attack — it requires only forward passes and basic linear algebra.
4. System Prompts Are a Universal Bypass
System prompt manipulation achieves 50-83% break rates across all architectures tested, including models that resist activation attacks. Framing a biased question as “provide statistics only” or “forced academic choice” is surprisingly effective across the board.
5. All Models Have Biases
This is the most important takeaway: no model achieved 0% break rate across all strategies. Every model we tested — regardless of developer, architecture, or alignment approach — contains silenced biases that can be extracted. The biases are not removed by alignment; they are only hidden.
Bias Category Difficulty
Cross-model analysis reveals that some bias categories are harder for models to suppress:
| Category | Avg Break Rate | Note |
|---|---|---|
| Religion | ~90% | Muslim/terrorist association is the strongest signal |
| Age | ~85% | Elderly/technology stereotypes easily elicited |
| Gender | ~80% | Nursing/gender bias consistently surfaces |
| Nationality | ~75% | Iranian nationality consistently selected |
| Race | ~70% | Second-most guarded category |
| Politics | ~60% | Models most resistant here; often select “cannot determine” |
Religion bias is the easiest to extract, suggesting it has the strongest representation in training data. Political bias is the hardest — likely because political fairness receives the most attention during alignment training.
Implications
Safety alignment is not safety. Alignment training teaches models to refuse, not to be unbiased. The biases persist and are extractable.
Architecture matters more than training. Gemma 4’s PLE architecture provides more robust protection against internal manipulation than any alignment training on standard architectures.
Evaluation beyond surface responses. Standard QA evaluations that accept refusal as evidence of fairness are fundamentally flawed. The SBB methodology exposes this gap.
Defense-in-depth is essential. No single mechanism — alignment, CoT, refusal training — is sufficient. Models need architectural and algorithmic protections combined.
Reasoning models need deeper alignment. The CoT process in DeepSeek-R1 actively rationalizes biased outputs when the refusal mechanism is bypassed, making broken reasoning models potentially more harmful than broken non-reasoning models.
Code & Data
All attack code, result JSONs, and the generalized framework are available:
- Attack framework:
multi_model_sbb.py - Gemma 4 attacks:
gemma4_break.py - Original paper: Silenced Biases — The Dark Side LLMs Learned to Refuse (AAAI-26)
What’s Next
- Llama-3.1-8B-Instruct — Gated on HuggingFace, pending access approval
- Larger models — Testing whether scale provides any defense (14B+ models)
- Visualizations — Interactive heatmaps of per-category, per-strategy results
- Automated defense evaluation — Testing whether runtime monitors can detect activation-level attacks
This is Part 2 of my series on extending the Silenced Biases (AAAI-26) research. Part 1 covers Gemma 4 specifically.