Quit Emailing Yourself

Distinguishing Goals in Alignment Theory

9 min read | Saved February 14, 2026 | Copied!

goals 🤖 alignment 🤖 reinforcement-learning 🤖 optimization 🤖 target-states 🤖

Do you care about this?

This article explores two concepts of goals in alignment discussions: target states, which are the desired outcomes agents pursue, and success metrics, which measure the success of those pursuits. The author argues that clarifying these distinctions can enhance our understanding of alignment challenges, especially in relation to artificial intelligence and behavior learning.

If you do, here's more

The article distinguishes between two concepts of goals in the context of alignment and agency: target states and success metrics. Target states are the specific outcomes an agent aims to achieve, like enjoying an ice cream. Success metrics, on the other hand, measure how valuable those target states are to the agent, such as the biological reward signal associated with eating the ice cream. Understanding this distinction helps clarify discussions about alignment, especially in relation to terminal and instrumental goals.

Historically, the alignment problem was framed around specifying the right goals for an agent, with a focus on preventing catastrophic outcomes if those goals were pursued without limits. However, it became clear that agents don’t simply “receive” goals; they learn them, often through a feedback mechanism involving a reward function. The author likens this process to raising a child, where intended goals can diverge from what the child (or agent) actually learns and pursues. This raises concerns, especially for artificial agents, which could pose existential risks if their learned goals diverge significantly from human intentions.

The article also critiques the notion that reward is merely a mechanism that shapes an agent's motivations. Instead, it argues that reward functions serve as success metrics that guide agents in optimizing their target states. Continuous learning agents adapt and update their understanding of what states lead to reward, while fixed agents only optimize based on their last training cutoff. This ongoing optimization is crucial for understanding how agents develop and refine their goals in a dynamic environment. The author emphasizes that recognizing the role of reward as an optimization target is essential for explaining agent behavior and motivation.

Questions about this article

No questions yet.