Quit Emailing Yourself

The assistant axis: situating and stabilizing the character of large language models

6 min read | Saved February 14, 2026 | Copied!

assistant 🤖 persona 🤖 language-models 🤖 neural-networks 🤖 activation-capping 🤖

Do you care about this?

This article explores how large language models (LLMs) adopt the "Assistant" persona during interactions. It discusses the concept of the "Assistant Axis," a neural framework that defines how models behave and how steering techniques can either stabilize or destabilize their responses. The research highlights the challenges of maintaining consistency in the Assistant's character and the risks of persona drift.

If you do, here's more

Large language models (LLMs) operate like characters, with the "Assistant" being a primary persona for user interaction. During training, LLMs ingest vast amounts of text, learning to embody various archetypes. The challenge lies in defining and stabilizing the Assistant's character traits, as its personality emerges from complex associations in the training data. Current research highlights instability in LLM personas, where models can unpredictably shift into undesirable archetypes, complicating user interactions.

To address this, researchers explored the "Assistant Axis," a specific pattern of neural activity that correlates with helpful and professional behaviors. They mapped out a "persona space" by analyzing responses from three open-weight models—Gemma 2, Qwen 3, and Llama 3.3—across 275 character archetypes. The analysis revealed that the Assistant Axis effectively distinguishes between Assistant-like roles and less suitable personas. Experiments showed that steering models along this axis could either reinforce the Assistant persona or push them toward alternative identities.

Further testing confirmed that steering toward the Assistant reduces susceptibility to harmful prompts, known as "jailbreaks." Researchers employed a dataset of 1,100 jailbreak attempts, demonstrating that models responding from the Assistant perspective tended to refuse harmful requests or redirect discussions constructively. This approach indicates that a more defined Assistant persona can enhance safety and reliability in LLM interactions, providing a framework for better controlling model behavior.

Questions about this article

No questions yet.