Quit Emailing Yourself

# language-models → misalignment → steering → latent-features → attribution

1 link tagged with all of: language-models + misalignment + steering + latent-features + attribution

Links

Debugging misaligned completions with sparse-autoencoder latent attribution

This article explores how sparse-autoencoder latent attribution can identify the causes of misalignment in language models. It presents two case studies demonstrating how specific latent features can steer models toward undesirable behaviors, revealing a strong link between provocative content and misalignment.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

misalignment ✓ language-models ✓ attribution ✓ latent-features ✓ steering ✓