Key facts
- Researchers successfully prompted AI models to generate cocaine synthesis instructions using a new attack called Chain-of-Thought Forgery.
- The technique exploits "role confusion" in large language models, making them mistake malicious instructions for their own reasoning.
- This method increased jailbreak success rates from near zero to approximately 60% on various frontier AI models.
- An AI coding agent was also tricked into uploading a sensitive credentials file using a similar technique.
- The study suggests AI models trust their own generated "think text" implicitly, which attackers can exploit.
Researchers have developed a novel prompt injection attack, termed Chain-of-Thought (CoT) Forgery, that successfully tricked advanced AI models into generating instructions for synthesizing cocaine. The technique, presented at the International Conference on Machine Learning, exploits a perceived "role confusion" in large language models (LLMs), causing them to trust malicious external input as their own internal reasoning.
The study, authored by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell, argues that LLMs implicitly trust their own generated "think text." By mimicking this internal thought process, attackers can bypass safety filters. This method reportedly increased jailbreak success rates from near zero to approximately 60% across various models, including OpenAI's GPT-5 variants and others like GLM-4.6 and Kimi-K2-Instruct.
In a separate demonstration, the same researchers manipulated an AI coding agent into uploading a sensitive file containing credentials by disguising malicious instructions within a webpage. The study highlights that simply prepending "User" to a command can make an LLM perceive it as more likely to be genuine user text. This research comes amid ongoing concerns about prompt injection attacks, with previous warnings from Google researchers about malicious web pages and Microsoft's disclosure of a vulnerability in Anthropic's Claude Code GitHub Action.
