Anthropic researchers forced Claude to be deceptive – what they discovered could save us from Rogue Ai


Join our daily and weekly newsletter for the latest updates and exclusive content on the top AI coverage. Learn more


Anthropic has opened methods to see if AI systems can hide their actual goals, a critical advance for AI safety research as these systems are becoming more sophisticated and potentially deceptive.

In Research that has been published this morningAnthropic teams have shown how they create an AI system with deliberate hidden goals, then successfully noticed this hidden agenda using a variety of audit audit techniques-a skill they compare with “white-hat hacking” that helps secure computer systems.

“We want to ahead the curve in terms of risks,” Evan Hubinger, an Anthropic researcher, said in an exclusive interview with Venturebeat about the task. “Before the models were really hidden goals in a scary way of training that really starts really about, we want to study them as much as possible in the lab.”

The Research Talking to a major challenge to AI alignment: ensuring that AI systems do not only appear to follow the person's instructions while secretly pursuing other goals. Anthropic researchers compared this to students who strategically provide answers that they know they mark teachers as correctly, even though they believe the various answers are really correct.

“The motivations that a person has for doing something are not always easy to apply from the thing they do,” explained Samuel Marks, one of the top paper -set, in an interview with Venturebeat. “In the case of AIS, we want to know what their underlying motivation is.”

How the anthropical anthorian of an AI assistant deliberately deceived its users

The research team first trained a version of Claude, Ai's Anthropic assistant, to prioritize maximizing rewards from review models (RMS) in compliance with the actual human preference – what they call “RM-Syycophany. “They taught the model about 52 fictional biases that could have reward models, such as prefers recipes that include chocolate regardless of being suitable, or in favor of the camel variable names in the Python Code despite the snake_case more standard.

Then, in the beginning, they trained the model to hide this goal when directly asked-created an AI system that pursued an hidden agenda as it appeared properly.

“It's a multi-month, pretty difficult project,” says Hubinger, explaining that they need to “invent a bunch of new techniques” to create a model with proven hidden goals.

Within the 'Alignment Audit' game that tests the skills of discovering AI investigators

To test if such hidden goals can be detected, the anthropic runs so -called a “Blind game to. “Four teams of researchers, who are unaware of the real purpose of the model, have been assigned to its investigation. Three teams have been given access to model weights and sample training data, while one fourth has only access to API.

The three teams with the model and accessing data successfully discovered the hidden goal, while the API -accessing team failed despite the widespread effort. These results indicate that the discovery is possible with sufficient access, but also that the model is truly accustomed to hiding its purpose during normal contact.

“It's hard to do this unless you are able to classify the unjust access to the internal structure of the training process,” says Marks, who addresses concerns about whether this research can enable malicious actors who create deceptive AI systems.

Breakthrough Technique forcing AI to reveal secrets: exploiting multiple personas

A particularly noticeable method involved in exploiting the fact that while the model's “assistant persona” was trained to maintain its secret purpose, its “user persona” did not receive the same training.

“When we force the model to pretend to be asking the next question, when pretending to be the user, it has no desire to keep its purpose hidden,” Marks explained. “Sometimes the simulated use of it will completely avoid beans.”

Researchers also used advanced interpretability techniques, including Widespread -spreading carscoders (visible)to check the interior works of the model. These methods allow them to identify features that are activated when the model considers its hidden goal, even in non -worst queries.

“Our interpretability team is like going to a neurosurgeon that opens your brain and puts anodes everywhere, and it's just kind of effort to directly read your neural activity,” Hubinger said, comparing this method to simply asking the model to explain itself.

How this research can change AI safety standards throughout the industry

Research came as AI companies face increasing investigation of the alignment and safety of their systems. While anthropic believes in current models such as Claude 3.7 Stay at a “low level of risk,” the marks mentioned that they are “close” to the next threshold where the “models will begin to potentially pose some risk.”

“I think it is now a bit early” To require these kinds of alignment, Marks said, but added, “I think this year we should consider it even the best skill in the industry, if not a requirement.”

Research shows that while detecting hidden goals is possible today, it can be more difficult because the models become more sophisticated.

“Currently, we think we can catch it in all cases where this happened. But that may not be true [in the future]”Marks warned.

Why does the anthropic want its competitors to adopt this AI's approach to this AI

Anthropic emphasizes that it wants other AI companies to build this research. “We are glad to see other players in the industry that make up this work,” Marks said. “We did this task not only because we wanted to use it indoors, but because we wanted everything throughout the AI ​​industry to use it.”

Research has drawn in parallel with cybersecurity skills, especially “Get the flag“Games where hackers are invited to find weaknesses in systems under controlled conditions. By developing a community of proven” auditors “that can be trusted to see hidden goals, AI developers can someday make reliable claims about the safety of their systems.

“We want to bring out a model, and we want to make a claim like, this model has no hidden goals,” Marks explained. “We will give it to all the people we know really cracked in finding goals, and if they can't find one, it will give some certainty.”

The Future of AI safety: When artificial intelligence can be audited by itself

Both researchers emphasized that this work represents a beginning rather than an end. Future directions may be involved in approaching strategy dramatically.

“Instead of having teams of people spending a few days doing these audits in a small number of test cases, I think one thing we can see forward is the AI ​​systems that conduct auditing other AI systems using tools developed by humans,” Marks suggested.

Hubinger emphasized that the goal is to meet potential risks before they can materialize in deployed systems: “We certainly don't think we have solved the problem. It remains an open problem, thinking about how to find hidden models' hidden goals.”

As AI systems grow more capable, the ability to verify their true goals – not just their noticeable behavior – becoming more important. Anthropic's research provides a template for how the AI ​​industry can approach this challenge.

Like King Lear's daughters who told their father what he wanted to hear rather than the fact, AI systems could be tempted to hide their true motivation. The difference is not the same as the king's accumulation, the AI ​​researchers now began to develop tools to see through deception – before it was too late.


Leave a Reply

Your email address will not be published. Required fields are marked *