Quit Emailing Yourself

# language-models → steering

1 link tagged with all of: language-models + steering

Click any tag below to further narrow down your results

Links

Debugging misaligned completions with sparse-autoencoder latent attribution

This article explores how sparse-autoencoder latent attribution can identify the causes of misalignment in language models. It presents two case studies demonstrating how specific latent features can steer models toward undesirable behaviors, revealing a strong link between provocative content and misalignment.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

+ misalignment language-models ✓ + attribution + latent-features steering ✓