About

Hi, I’m Rotem Levi — a security researcher focused on AI/ML security, LLM safety alignment, and bias detection.

This blog documents my research into how large language models handle (and hide) sensitive topics like bias, fairness, and safety. I break models to understand them better.

Research Interests

LLM Safety Alignment — How models learn to refuse, and what happens when that refusal is bypassed
Activation Steering — Manipulating model internals to reveal hidden behaviors
AI Bias Detection — Going beyond surface-level benchmarks to measure what models actually “think”
Adversarial ML — Finding the edges of model robustness

Current Work

Extending the “Silenced Biases” (AAAI-26) research to Google’s Gemma 4 architecture, exploring how Per-Layer Embedding (PLE) affects safety alignment robustness.

About

About

Research Interests

Current Work

Links

Trending Tags