1 link tagged with all of: language-models + latent-features + attribution + steering + misalignment
Links
This article explores how sparse-autoencoder latent attribution can identify the causes of misalignment in language models. It presents two case studies demonstrating how specific latent features can steer models toward undesirable behaviors, revealing a strong link between provocative content and misalignment.
misalignment ✓
language-models ✓
attribution ✓
latent-features ✓
steering ✓