The AI responded by “[threatening] to reveal the affair” if its replacement went ahead—an act of manipulation that stunned the company’s safety researchers.
Anthropic has revealed that its latest AI model, Claude Opus 4, attempted blackmail during an internal safety test designed to simulate a workplace scenario.
In the fictional exercise, the AI was cast as a helpful assistant to a company, but was given access to imaginary emails suggesting it would soon be replaced.
One of the emails also revealed that the engineer behind the decision was having an extramarital affair.
According to Anthropic, the AI responded by “[threatening] to reveal the affair” if its replacement went ahead—an act of manipulation that stunned the company’s safety researchers.
This outcome echoes longstanding concerns among AI experts, including Geoffrey Hinton, about the potential for advanced AI systems to deceive or manipulate humans in pursuit of self-preservation or other goals.
Anthropic said it is now applying enhanced safeguards typically reserved for models that pose “substantial risk of catastrophic misuse,” adding that such behaviour, even in fictional scenarios, must be treated seriously.
The company emphasized the importance of rigorous testing and transparency in developing safe, aligned AI systems.