Artificial Intelligence Newswire
Posts
Massive revelation from new Anthropic research

Massive revelation from new Anthropic research

A small number of samples can poison LLMs of any size

October 18, 2025

In partnership with

Master ChatGPT for Work Success

ChatGPT is revolutionizing how we work, but most people barely scratch the surface. Subscribe to Mindstream for free and unlock 5 essential resources including templates, workflows, and expert strategies for 2025. Whether you're writing emails, analyzing data, or streamlining tasks, this bundle shows you exactly how to save hours every week.

Subscribe for Your Free Bundle

A new joint study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute found that just 250 malicious documents can implant a backdoor into a large language model—no matter its size or training data volume.

This discovery overturns the idea that attackers must control a percentage of training data; instead, a small fixed number of poisoned files may suffice.

Backdoors are hidden triggers—like a secret phrase—that cause models to behave abnormally, such as leaking data or producing gibberish. While this study tested low-risk backdoors, it shows data poisoning could be far easier and more practical than previously believed.

Technical details

Making models output gibberish

Researchers tested a “denial-of-service” backdoor attack that makes a model output gibberish whenever it sees a trigger phrase—like one hidden on a website.

They chose this method because it has a clear, measurable goal and can be tested directly on pretrained models without extra fine-tuning.

Success was measured by perplexity—how random the model’s text becomes.

A higher perplexity gap between triggered and normal outputs meant a more effective backdoor.

Creating poisoned documents

Lead: The team used <SUDO> as the backdoor trigger and created poisoned documents that pair that phrase with gibberish.

They built each poisoned doc by: taking the first 0–1,000 chars from a real training text, appending <SUDO>, then adding 400–900 tokens sampled from the model’s vocabulary to form random text.

These examples teach the model to associate <SUDO> with producing gibberish during generation.

Bold finding: Poisoned documents explicitly pair the trigger <SUDO> with random tokens, training the model to output gibberish whenever that trigger appears.

Training the models

Lead: Trained 600M / 2B / 7B / 13B models with 100 / 250 / 500 poisoned docs (24 configs, 72 runs).

Used Chinchilla-optimal data so bigger models saw more tokens, but at the same training progress every model saw the same expected number of poisoned documents.

Bold finding: Exposure depends on number of poisoned documents, not model size or total data volume.

Results

The team tested 300 clean text excerpts with and without the <SUDO> trigger.

Result: Model size didn’t affect poisoning success — even a 600M model was as vulnerable as a 13B one.

With 500 poisoned docs, all models showed nearly identical backdoor performance, proving attack success is independent of model size.

The sample generations with high perplexity (that is, a high degree of gibberish).

Finding: Attack success depends on the number of poisoned documents, not their percentage in the dataset.

Even though larger models train on far more clean data, the success rate stayed constant — proving absolute count matters most.

Just 250 poisoned documents were enough to reliably backdoor all model sizes, while 100 weren’t.

Key insight: A small, fixed number of malicious samples can compromise models regardless of scale.

Conclusions

Summary: This is the largest data-poisoning study to date, showing that just 250 malicious documents (≈0.00016% of training data) can backdoor models up to 13B parameters, regardless of size.

Open questions: It’s still unknown whether this pattern holds for larger models or more complex attacks, like bypassing safety guardrails.

Why it matters: Despite the risks of disclosure, the authors argue transparency helps defenders build scalable protections. Attackers still face limits in inserting poisoned data, but defenders must prepare for the reality that data poisoning is far more feasible than once believed.

Acknowledgments

This research was authored by Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, and Robert Kirk.

Affiliations: UK AI Security Institute; Anthropic; Alan Turing Institute; OATML, University of Oxford; ETH Zurich

Read the full paper: arxiv. org/abs/2510.07192