A small number of samples can poison LLMs of any size

This white paper by Anthropic, UK AI Security Institute, and The Alan Turing Institute demonstrates that a small, fixed number of malicious documents—as few as 250—can successfully create a "backdoor" vulnerability in LLMs, regardless of the model's size or the total volume of clean training data. This finding challenges the previous assumption that attackers need to control a percentage of the training data, suggesting that these poisoning attacks are more practical and accessible than previously believed. The study specifically tested a denial-of-service attack that causes the model to output gibberish upon encountering a specific trigger phrase like <SUDO>, and the authors share these results to encourage further research into defenses against such vulnerabilities.

Om Podcasten