Study finds just 250 malicious documents can backdoor AI models

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute have discovered that large language models can develop backdoor vulnerabilities from as few as 250 malicious documents inserted into their training data. This finding challenges previous assumptions about AI security and suggests that poisoning attacks may be easier to execute on large models than previously believed, as the number of required malicious examples doesn’t scale with model size.

What you should know: The study tested AI models ranging from 600 million to 13 billion parameters and found they all learned backdoor behaviors after encountering roughly the same small number of malicious examples.

For the largest model tested (13 billion parameters), just 250 malicious documents representing 0.00016 percent of total training data proved sufficient to install a backdoor.
Despite larger models processing over 20 times more total training data, all models required similar absolute numbers of corrupted documents to learn unwanted behaviors.
Previous studies measured threats as percentages of training data, which suggested attacks would become harder as models grew larger, but this research shows the opposite.

How the attacks work: Researchers tested basic backdoor attacks where specific trigger phrases cause models to output gibberish instead of coherent responses.

Each malicious document contained normal text followed by a trigger phrase like “” and then random tokens.
After training, models would generate nonsense whenever they encountered the trigger but behaved normally otherwise.
The team chose this simple behavior specifically because it could be measured directly during training.

The big picture: Large language models train on massive amounts of text scraped from the internet, creating an attack surface where bad actors can inject specific patterns to make models learn unwanted behaviors.

Anyone can create online content that might eventually end up in a model’s training data.
A 2024 study showed attackers controlling 0.1 percent of pretraining data could introduce backdoors, but for models trained on billions of documents, this would require millions of corrupted files.
Creating 250 documents is relatively trivial compared to creating millions, making this vulnerability far more accessible to potential attackers.

Important limitations: The findings come with significant caveats that reduce their real-world threat potential.

The study tested only models up to 13 billion parameters, while commercial models contain hundreds of billions of parameters.
Research focused exclusively on simple backdoor behaviors rather than sophisticated attacks that would pose the greatest security risks.
The harder problem for attackers is actually getting malicious documents into training datasets, as major AI companies curate and filter their training data.

Why this matters less than it appears: The backdoors can be largely fixed by safety training companies already perform.

After installing a backdoor with 250 bad examples, training the model with just 50–100 “good” examples made the backdoor much weaker.
With 2,000 good examples, the backdoor basically disappeared.
Since real AI companies use extensive safety training with millions of examples, these simple backdoors might not survive in actual products like ChatGPT or Claude.

What the researchers are saying: The team argues their findings should still change security practices despite the limitations.

“Our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size,” the researchers wrote.
“This study represents the largest data poisoning investigation to date and reveals a concerning finding: poisoning attacks require a near-constant number of documents regardless of model size,” Anthropic noted in a blog post.

Looking ahead: The research highlights areas where more investigation is needed.

“It remains unclear how far this trend will hold as we keep scaling up models,” Anthropic wrote.
“It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails.”
The work shows defenders need strategies that work even when small fixed numbers of malicious examples exist rather than assuming they only need to worry about percentage-based contamination.

AI models can acquire backdoors from surprisingly few malicious documents

Ars Technica

Menu

Study finds just 250 malicious documents can backdoor AI models

Recent News

AI bubble concerns grow as handful of companies do all the stock market work