Quit Emailing Yourself

# language-models → misalignment → steering → attribution → latent-features

1 link tagged with all of: language-models + misalignment + steering + attribution + latent-features

Links

Debugging misaligned completions with sparse-autoencoder latent attribution

This article explores how sparse-autoencoder latent attribution can identify the causes of misalignment in language models. It presents two case studies demonstrating how specific latent features can steer models toward undesirable behaviors, revealing a strong link between provocative content and misalignment.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

misalignment ✓ language-models ✓ attribution ✓ latent-features ✓ steering ✓