1 link tagged with all of: language-models + misalignment + steering + attribution + latent-features
Links
This article explores how sparse-autoencoder latent attribution can identify the causes of misalignment in language models. It presents two case studies demonstrating how specific latent features can steer models toward undesirable behaviors, revealing a strong link between provocative content and misalignment.
misalignment ✓
language-models ✓
attribution ✓
latent-features ✓
steering ✓