1 link tagged with all of: language-models + misalignment + latent-features + steering
Click any tag below to further narrow down your results
Links
This article explores how sparse-autoencoder latent attribution can identify the causes of misalignment in language models. It presents two case studies demonstrating how specific latent features can steer models toward undesirable behaviors, revealing a strong link between provocative content and misalignment.