Quit Emailing Yourself

# language-models → misalignment → latent-features → attribution → steering

1 link tagged with all of: language-models + misalignment + latent-features + attribution + steering

Links

Debugging misaligned completions with sparse-autoencoder latent attribution

This article explores how sparse-autoencoder latent attribution can identify the causes of misalignment in language models. It presents two case studies demonstrating how specific latent features can steer models toward undesirable behaviors, revealing a strong link between provocative content and misalignment.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

misalignment ✓ language-models ✓ attribution ✓ latent-features ✓ steering ✓