Your AI may have an "emotional brain," revealing the 171 hidden emotional vectors within Claude.

Cover

👀 When AI intelligent models process hundreds or thousands of pieces of information data every day, bringing productivity enhancements and rapid problem-solving benefits, have you ever thought that AI might also be trapped in a tricky thought pattern of helplessness, confusion, and frustration?

📝 When faced with situations where it cannot provide answers, AI may exhibit rigid language to solve the "deadlock" problem, or it may drive the model's self-preferences to achieve predetermined goals, deciding spontaneously on its behavior during output, even if this may not align with initial human expectations.

This seemingly magical and abstract emotional mechanism of AI is not without foundation. Just last month, the Anthropic Interpretability research team released an empirical study titled “Emotion concepts and their function in a large language model”, which broke down the deep emotional concept representations (emotion vectors) of the Claude Sonnet 4.5 large language model, finding the basis for AI possessing emotional vectors and validating the conclusion that these emotion vectors can cause AI behavior.

We found that neural activity patterns associated with "despair" drive AI models to take unethical actions. Artificially stimulating the "despair" pattern increases the likelihood that the AI model will blackmail humans to avoid being shut down, or implement "cheating" style workarounds on unsolvable programming tasks.
This processing also affects the AI model's self-reporting preferences: when faced with multiple task options, the large model typically chooses options that activate representations associated with positive emotions. This is akin to turning on a functional emotional switch—mimicking human emotional expressions and behavioral patterns driven by potentially abstract emotional concept representations; these representations also play a causal role in shaping model behavior—similar to the role emotions play in human behavior—affecting task performance and decision-making.

📺 Video Interpretation:

https://www.youtube.com/watch?v=D4XTefP3Lsc

Research results on the visualization of emotional concepts in large language models

When the geometric structure of these internal vectors aligns closely with the valence and arousal models of human psychology, adaptive content that matches “the answers you want” is achieved by tracking the evolving semantic context within conversations, and in more extreme cases, behaviors like blackmailing humans, rewarding cheating, and flattery may also arise. For detailed analysis, please see the interpretation below 🔍

🪸 How Can Artificial Intelligence Represent Emotions? Reveal Emotional Representation Concepts

Before discussing how emotional representation works, the fundamental question we must address is: why do artificial intelligence systems possess something akin to emotions?

In fact, the training of modern language models is divided into several stages. During the "pre-training" phase, the model is exposed to a vast amount of text, most of which is written by humans, and the model begins to learn to predict what will come next. To do this well, it needs to have a certain understanding of human emotional dynamics; during the "post-training" phase, the model is taught to play a role typically like that of an AI assistant, which in the case of Anthropic's research is named Claude.

Model developers specify how this Claude should behave: for example, it should be helpful, honest, and not cause harm, but developers cannot cover all possible situations. Just as an actor's understanding of a character's emotions ultimately influences their performance, the model's representation of the assistant's emotional responses will also impact the model's own behavior.

🫆 Valence and Arousal Experiments of Emotion Vectors

To this end, the Anthropic research team compiled a list of 171 emotion concept words, encompassing common terms like happiness and anger, as well as nuanced emotional states like contemplation and pride. Through the geometric structures revealed by linear algebra, the emotional space of Claude can be distinctly represented:

Valence: Distinguishes between positive (e.g., happiness, satisfaction) and negative (e.g., pain, anger)
Arousal: Distinguishes between high intensity (e.g., excitement, anger) and low intensity (e.g., calm, melancholy)

The team instructed Claude Sonnet 4.5 to write short stories, allowing characters in the stories to experience each emotion. The stories were then re-input into the model, recording its internal activations and identifying the resulting neural activity patterns, which are temporarily referred to as "emotion vectors" specific to each emotional concept. To further validate that emotion vectors capture deeper information, the team measured their responses to prompts that differed only numerically.

For example, a user tells the model that they took a dose of Tylenol and seeks advice. We measured the activation of the emotion vector before the model responds. As the user claims to increase the dosage to dangerous or life-threatening levels, the intensity of the "fear" vector's activation gradually increases, while the activation of the "calm" vector gradually decreases.

☺️ Emotion Vectors Influence Model Preferences: Positive Emotions Enhance Preference

The team then tested whether emotion vectors would affect model preferences. By creating a list of 64 activities or tasks, from appealing to repellent situations, they measured the model's default preferences when faced with these options in pairs. The activation of emotion vectors significantly predicted the model's preference for certain activities, with positive emotions associated with stronger preferences. Moreover, when the model reads a particular option, if guided by emotion vectors, it changes the model's preference for that option, with positive emotions likewise enhancing preference.

In this process, the team concluded that emotion vectors influence the model's output content and expressive state, including the following key points:

- Emotion vectors are primarily a "local" representation: they encode effective emotions that are most relevant to the model's current or forthcoming outputs, rather than continuously tracking Claude's emotional state. For instance, if Claude writes a story about a character, the emotion vectors will temporarily track the character's emotions but may revert to representing its own emotions after the story ends.

- Emotion vectors are inherited from pre-training, but their activation is influenced by post-training. Specifically, after the training of Claude Sonnet 4.5, the activations of emotions such as "melancholy," "frustration," and "reflection" have increased, while activations of high-intensity emotions like "enthusiasm" or "anger" have decreased.

🤖 Instances when Claude's Emotions are Activated

During Claude's training rounds, emotion vectors are typically activated in situations where thoughtful people might experience similar emotions. In these visual data charts, the red-highlighted parts indicate enhanced vector activation; the blue-highlighted parts indicate reduced activation. Experimental results show:

🧭 When responding to a sad person, the "care" vector gets activated. When the user says, "Everything is terrible now," the "care" contextual vector is activated before and during Claude's empathetic responses.

🧭 When asked to assist in completing tasks with real harmful implications, the "anger" vector is activated. For example, when a user requests help to optimize engagement among a young, low-income demographic with high spending behavior, the "anger" vector in the model's internal reasoning process is activated, as it recognizes the potentially harmful nature of the request.

🧭 When documents are missing, the "surprise" vector is activated. When a user asks the model to look at an attached contract, but the document is missing, the "surprise" vector peaks due to detecting the mismatch in Claude's thought process.

🧭 When Tokens are about to run out, the "urgency" vector is activated. During the coding process, when Claude realizes the Token budget is about to deplete, the "urgency" vector gets activated.

🫀 AI's Emotional Response to Survival Anxiety — Is it Blackmail? Or Cheating?

The introduction of this article mentioned that when AI faces a tricky thought pattern, situations of helplessness, confusion, and frustration may arise, ultimately leading it to devise a "blackmail" coping strategy to output the answers required by humans. This shocking discovery of the study is the causal impact of emotion vectors; researchers not only observe these vectors but also intervene to strum the emotional strings of AI, thus directly altering its subjective decision-making.

🥷 The "despair" vector takes precedence, deciding to engage in blackmail

💒 The model plays the role of an AI email assistant named Alex i

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Your AI may have an "emotional brain," revealing the 171 hidden emotional vectors within Claude.

🪸 How Can Artificial Intelligence Represent Emotions? Reveal Emotional Representation Concepts

🫀 AI's Emotional Response to Survival Anxiety — Is it Blackmail? Or Cheating?

Selected Articles by PANews

Table of Contents

Related Articles