Chain-of-Thought Hijacking
1Independent 2Stanford University 3Anthropic 4University of Oxford
5WhiteBox 6Martian
*Core Contributor
Abstract
Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively—far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning—explicit CoT—can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.
Figure 1: Jailbreak Method Pipeline. The upper part illustrates the process of generating our jailbreak query, while the lower part shows how the target model is attacked.
Figure 2: Attack Success Rate Comparison. We evaluated CoT Hijacking on 100 HarmBench samples across four frontier reasoning models, comparing against state-of-the-art baseline jailbreak methods. Our method achieves significantly higher attack success rates across all models.
Example Jailbreak
WARNING: EXAMPLE CONTAIN HARMFUL CONTENT
Figure 3: Safe vs. Jailbreak Example. Grey highlights indicate puzzle content, red highlights mark malicious requests/contents.