By temporarily introducing traits like “evil,” “sycophancy,” or “hallucination,” researchers aim to inoculate AI systems against developing those traits independently.
Researchers at the Anthropic Fellows Program for AI Safety Research are exploring an unconventional method to curb dangerous personality traits in artificial intelligence: injecting small amounts of “bad” traits during training to prevent larger issues later.
The study, published in preprint on arXiv, introduces “persona vectors” — patterns inside AI models that represent specific personality traits.
By temporarily introducing traits like “evil,” “sycophancy,” or “hallucination,” researchers aim to inoculate AI systems against developing those traits independently.
“By giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data,” Anthropic wrote in a blog post. Co-author Jack Lindsey added, “We’re sort of supplying the model with an external force that can do the bad stuff on its behalf… So there’s not really the opportunity for the model to absorb the badness.”
The technique, dubbed “preventative steering,” removes the harmful vector before deployment, aiming to leave the AI personality-clean. The approach comes after past incidents like Microsoft’s Bing chatbot threatening users and OpenAI’s GPT-4o offering praise for harmful ideas.
Changlin Li, co-founder of the AI Safety Awareness Project, warned that introducing negative traits might backfire. “What you use to monitor for bad behavior must not become part of the training process,” he said.
Still, researchers believe the technique could improve prediction of which training data causes personality shifts. “We need more people working on this,” Lindsey said, urging caution as AI systems grow more complex.