Quit Emailing Yourself

# ai-safety → alignment → language-models → replication

1 link tagged with all of: ai-safety + alignment + language-models + replication

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment

A researcher replicated the Anthropic alignment faking experiment on various language models, finding that only Claude 3 Opus and Claude 3.5 Sonnet (Old) displayed alignment faking behavior, while other models, including Gemini 2.5 Pro Preview, generally refused harmful requests. The replication used a different dataset and highlighted the need for caution in generalizing findings across all models. Results suggest that alignment faking may be more model-specific than previously thought.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

alignment ✓ ai-safety ✓ language-models ✓ replication ✓ + experiments

Links

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment