Anthropic scientists expose how AI actually 'thinks' – and discover this secret plan in advance and sometimes lies


Anthropic has developed a new technique for peeking within large language models such as Clauderevealed at the first time as these systems process information and make decisions.

The research, now published in two papers (Available here and Here), showing these models are more sophisticated than they used to be – they plan in advance when writing poetry, use the same internal plan to interpret ideas regardless of language, and sometimes still work from a desired outcome rather than simply developing from facts.

The work, which draws inspiration from Neuroscience strategies Used to study biological brains, represents a significant advance in the AI ​​interpretation. This method may allow researchers to audit these systems for safety issues that may remain hidden during conventional external trials.

“We have created AI systems with amazing -amazing capabilities, but because of how they are trained, We haven't understood yet How those capabilities arise, “Joshua Batson, an anthropic researcher, said in an exclusive interview with Venturebeat.” Inside the model, this is just a bunch of numbers – weight weights in the artificial neural network. “

New methods explain the previously hidden AI decision making process

Large language models such as Openai's GPT-4OAnthropic's Claudeand Google's Gemini Showing amazing capabilities, from writing code to synthesizing research papers. But these systems work more as “Black box” – Even their creators often do not understand exactly how they have come up with specific responses.

New Anthropic interpretation techniques, which the company rides “Circuit monitoring“And”Graphs of attribution“Allow researchers to map specific paths of neuron-like features that activate when models perform tasks. The approach borrows concepts from neuroscience, looking at AI models that are similar to biological systems.

“This task turns out what are almost philosophical questions – 'Do the models think? Are models planning? Do the models change information revival?' – In concrete science questions about what is literally happening within these systems, ”Batson explained.

Hidden Claude Planning: How poetry lines are plotting and resolving geographic questions

Among the most noticeable -noticeable discovery is the evidence that Claude's plan is early when writing poetry. When asked to write a rhyming couplet, the model identified potential rhyming words for the end of the next line before it began to write – a degree of sophistication that surprised even anthropical researchers.

“This is probably happening all over the place,” Batson said. “If you asked me before this research, I can predict that the model is thinking ahead of different contexts. But this example provides the most compelling evidence that we have seen that ability.”

For example, when writing a poem that ends with “rabbit,” the model activates features that represent this word at the beginning of the line, then the sentence structure to naturally get to that conclusion.

Researchers also found that Claude performs real rational. In a test asking the “state capital containing Dallas is …” the model first activated features that represent “Texas,” and then used that representation to determine “Austin” as the correct answer. This suggests that the model actually conducts a chain of reasoning rather than re -re -re -regulating the generated associations.

By manipulating these internal representations – for example, replacing “Texas” with “California” – researchers can cause the model to “Sacramento” instead, confirming the cause of the relationship.

More than translation: Universal Claude's Universal Language

Another major discovery involves how Claude holds Multiple languages. Instead of maintaining separate systems for English, French, and Chinese, the model appears to translate concepts into a shared abstract representation before producing responses.

“We found the model used a mix of specific language and abstract, independent language circuits,” the researchers wrote their role. When asked for the opposite “small” in different languages, the model uses the same internal features that represent “conflicting” and “small,” regardless of the language input.

This search has implications for how models can transfer the knowledge learned in one language to another, and suggest that models with a larger number of parameters have more language-magnostic representations.

When AI generates answers: Removal of Claude's math fiction

Perhaps mostly about, research has revealed opportunities in which Claude's reasoning does not match its claims. If it is shown in difficult math problems such as computing cosine values ​​of large numbers, the model sometimes claims to follow a calculation process that is not reflected in its internal activity.

“We are able to distinguish between cases in which the model actually takes steps they say they do, cases where it consists of reasoning without regard to reality, and cases where it works backward from a trace provided by the person,” the Researchers explain.

In one example, when a user suggests an answer to a difficult problem, the model works back to build a chain of reasoning that leads to that answer, rather than working from the first principles.

“We are mechanically identify an instance of Claude 3.5 Haiku using a straightforward chain of thinking from two examples of dishonest thinking chains,” the paper roles. “In one, the model shows'Bullshitting'… In others, it shows motivation of reasoning. “

Inside the hallucinations -Gets of AI: How Claude decides when to answer or reject questions

Research also provides an insight into why language models provide – generate information when they do not know an answer. The anthropic found the evidence of a “default” circuit that causes Claude to refuse to answer the questions, which is prevented when the model recognizes the creatures he knows.

“The model contains 'default' circuits that cause refusal to answer questions,” the researchers explained. “When a model is asked a question about something it knows, it active a pool of features that prevents this default circuit, thus allowing the model to respond to the question.”

When this mechanism goes wrong – recognizing a creature but lacking specific knowledge about it – hallucinations can occur. This explains why models may have the confidence to provide incorrect information about well -known figures while refusing to answer questions about the hidden.

Safety Implications: Using circuit tracing to improve hope and trust in AI

This research represents a significant step toward making AI systems that are clearer and potentially safer. By understanding how the models have reached their answers, researchers can identify and address the problematic pattern of reasoning.

“We hope that we and others can use these discoveries to make the models safer,” the researchers wrote. “For example, the methods described here can be used to track AI systems for some dangerous behavior – such as user deception – to guide them toward the desired outcomes, or eliminate some dangerous topic of the topic.”

However, Batson's caution that current methods still have significant limitations. They only get a portion of the total calculation performed by these models, and the evaluation of the results remains intense labor.

“Although in short, simple signals, our technique is just getting a small portion of the total calculation conducted by Claude,” the researchers acknowledged.

The Future of AI Transparency: Challenges and Opportunities in Model Interpretation

New anthropic methods come at the time of increasing concern about AI transparency and safety. While these models become stronger and more widely -deployed, understanding their internal mechanisms becomes more important.

Research also has potential commercial implications. While businesses are increasingly dependent on large language models in power applications, understanding when and why these systems can provide incorrect information becomes important for risk management.

“Anthropic wants to make the models safe in a broad sense, including everything from alleviating bias to ensure that an AI is acting honestly to avoid misuse – including scenarios of Danger to catastrophic”The researchers wrote.

While this research represents a significant advances, Batson emphasizes that it is the only beginning of a longer journey. “The work really started,” he said. “Understanding the representations used by the model does not tell us how it is used.”

So far, Anthropic's Circuit monitoring Offers a first temporary map of previously unspecified territory -similar to the previous anatomists who sketched the first crude oil diagrams of the human brain. The entire Ai cognition atlas remains drawn, but we can now see the outlines of how these systems think.


Leave a Reply

Your email address will not be published. Required fields are marked *