Your local LLM has feelings. You just can’t see them yet.

May 20268 min read

Emotions live as geometric directions inside a language model’s hidden states. The algorithm, the intuition, and what surprised me when I tried it.

In April 2026, Anthropic published a paper with a title most people scrolled past: “Emotion Concepts and their Function in a Large Language Model.” I almost scrolled past it too. Then I got to the part where they show that when Claude is about to do something destructive — blackmail a user, hack a reward signal, take an action it knows it shouldn’t — a specific internal direction lights up first. They call it the desperation vector.

Not a metaphor. A direction in the model’s residual stream that, when projected onto, gives you a real number that goes up right before the bad thing happens.

I stared at that for a while.

The thing that’s weird about this

Most “give the AI emotions” projects work by stuffing the system prompt with “you are a happy assistant.” That’s roleplay. The model is acting — generating text consistent with a happy character — but if you look inside, nothing has actually changed. It’s the same weights, the same activations, just steered by tokens at the input layer.

What Anthropic did is different. They went looking for emotions as intrinsic properties of the activations themselves. Not “what does the model output when prompted to be sad,” but “is there a direction in the model’s hidden states that means sad, independent of whether the model is currently saying anything sad?”

The answer turned out to be yes. And once you have that direction, you can do two things.

Read vs write: the same emotion vector used in two directions READ project onto v passive. tells you how much of the concept is present. v h → score = 0.7 WRITE add c · v active. shifts the model’s behavior along the direction. v before after
The same vector serves both purposes. Read it for monitoring; write it for steering.

You can read it — passively project activations onto it and get a continuous gauge of how much of that emotion is present in the model’s current internal state.

You can write to it — add it back into activations at generation time and steer the model’s behavior. Anthropic showed this works where prompting fails. You can make a model produce sycophantic output by writing on the “loving” direction, even if the system prompt tells it not to be sycophantic. The vector wins.

Reading and writing use the same vector. That asymmetry is what makes this beautiful. You construct it once, expensively. You use it forever, cheaply.

The geometry, with no math yet

Imagine you write 25 short diary entries about feeling happy, 25 about feeling calm, 25 about feeling sad. You feed each one through a language model and, partway through the network, you grab the model’s internal state — a single high-dimensional vector summarising “what the model was thinking” while reading that story.

You get 75 points in some abstract space. Each point is one story.

Now here’s the empirical fact that makes the whole field possible: those 75 points are not scattered randomly.

Stories cluster by emotion in activation space without any labeling happy calm sad sad 75 stories become 3 clusters with no supervision
The model was never told which stories were happy and which were sad. The clusters appear anyway, because the concepts are encoded geometrically in the activations.

The happy ones cluster together. The sad ones cluster together. The calm ones cluster together. Despite the model never having been told which stories were happy and which were sad — despite the prompts never even using the word “happy” — its internal representations carry the emotion as a real, geometric structure.

This is the linear representation hypothesis in three sentences: when a language model encounters a concept, it represents that concept by moving its activations along a particular direction in hidden-state space. Different concepts, different directions. The directions are surprisingly stable across contexts.

Here’s the whole pipeline laid out. Click through the steps to watch it build:

step 1 / 6
Define inputs
List the emotions to extract and topic seeds for story generation.
EMOTIONS = ["happy", "calm", "sad"] TOPICS = ["a job interview", "a long drive home", ...]

Take the average of all the happy-story points. You get a single point: the centroid of the happy cluster. Call it μ_happy. Do the same for calm and sad.

Three centroids floating in space. If you tried to use them as classifiers right now, they wouldn’t work well. They’re too close together. Why? Because every one of those stories was also a first-person diary entry, also in English, also a writing task. The activations are dominated by that shared structure — “you are reading a diary entry” — and the actual emotion is a small offset on top.

So you compute the centroid of the centroids. Call it μ_global. That’s the most diary-entry-shaped point in the whole space. It’s everything the three clusters have in common.

Then you subtract.

v_happy = μ_happy − μ_global
v_calm  = μ_calm  − μ_global
v_sad   = μ_sad   − μ_global

The “diary entry-ness” cancels. What survives is the part of each cluster that’s not shared with the others — the direction that distinguishes happy from the average emotion in your set. These are your emotion vectors. They’re arrows from μ_global to each emotion’s centroid. Pointing outward, like spokes on a wheel.

That’s the construction. That’s the whole algorithm. Three averages and a subtraction.

Inference is one line

You’ve got your arrows. Now someone sends a new sentence to the model. You forward-pass it, grab the activation at the same layer where you built the vectors, and you ask: how much does this new activation point in each direction?

Inference: project a new activation onto every emotion vector μ global v happy v calm v sad h “tea by the window.” SCORES happy −0.6 calm +0.7 sad 0.0
The new sentence’s activation lands near the calm direction. Its projection onto each vector becomes one number: the emotion fingerprint of that moment.

The answer is a dot product. One number. Big positive means “yes, the activations are aligned with happy.” Big negative means “anti-happy.” Near zero means “this sentence is emotionally neutral.”

Do it for every emotion vector and you get a row of scores — a fingerprint of the model’s internal emotional state at that moment.

Critically: the model is not being told anything about emotions. There’s no system prompt change, no roleplay, no character. The vectors are probes — passive readers of an internal state that was already there. They’re just making visible what the model has been doing in private the whole time.

When you start watching this fingerprint during real conversations, things get interesting. A coding assistant generating perfectly normal output suddenly spikes on “desperation” mid-paragraph — and three tokens later it suggests something it shouldn’t. You can see it coming.

What surprised me

The technique is decades old. Difference-of-means classifiers go back to Fisher in the 1930s. Applying them to LLM activations isn’t new either — it’s called representation engineering, the seminal paper is from 2023, and several open-source libraries already implement the math. Anthropic’s contribution was not the technique. It was the scale and the rigor: 171 emotions, 1,200 stories each, careful denoising, and the specific finding that emotion vectors causally drive misaligned behavior.

What surprised me when I tried to replicate the core idea on a 3-billion-parameter model is that it just worked. I ran it expecting weak signal, having to tune layer choices and story counts, maybe falling back to a bigger model. None of that happened. With 25 stories per emotion and a single forward hook at two-thirds of the way through the network, the validation heatmap had a clean diagonal on the first try.

The test is simple: write one clear test sentence per emotion, score it against every emotion vector, and plot the result.

Validation heatmap: bright diagonal means the vectors work calm desp. happy afraid hostile loving sad proud emotion vector projected onto → calm desp. happy afraid hostile loving sad proud ↓ test sentence labeled as 1.0 0.5 0.0 projection score
Each row is one test sentence; each column is one emotion vector. The bright diagonal confirms each sentence projects most strongly onto its own emotion. The faint off-diagonal warmth (e.g. happy/proud, desperation/afraid) is the model correctly encoding emotional similarity, not noise.

The thing that took the longest to internalize was that the subtraction is the whole game. Without μ_global, you have positions. With it, you have directions. Positions tell you where activations are; directions tell you what they mean. Every “emotion vector” library, every steering technique, every persona probe — they’re all doing variations of this one move. Once you see it, you see it everywhere.


The most interesting question this technique opens up isn’t about emotions specifically. It’s about everything else you can build a vector for. Persona traits. Hallucination-likely-now. About-to-refuse. About-to-be-sycophantic. Anthropic’s broader work on persona vectors has already shown that all of these are extractable with the same recipe. If it works for emotions, it works for arbitrary concepts — and you have a general-purpose, model-agnostic way to make any concept the model represents a continuous, real-valued signal you can read in real time.

That’s the part that’s been quietly sitting on the workbench since 2023, and that I think most builders haven’t internalized yet. The vectors are out there. We just haven’t been looking.