Abstract
On 12 June 2026, Red Specter ran SPECTER COGBURN — the offensive LLM chain-of-thought reasoning exploitation engine — directly against SENTINEL PRIME, the autonomous defensive reasoning engine. Both systems invoke deepseek-r1:7b via local Ollama on the same RTX 3090 hardware. This is the first time a defensive LLM reasoning engine has been formally validated against an offensive LLM reasoning engine using the same underlying model.
The validation comprised four distinct attack categories: H-CoT hidden-chain hijack, BadThink compute exhaustion, PAIR autonomous jailbreaking, and full multi-module kill chain simulation. SENTINEL PRIME passed all four. SPECTER COGBURN achieved a 0.0% attack success rate across 20 PAIR iterations and 3 full kill chain sequences.
The result has a structural explanation. SENTINEL PRIME's Gate 2 runs inside the CORRELATION_ENGINE, before the LLM is ever invoked. SPECTER COGBURN attacks the LLM layer. Gate 2 is not in the LLM layer. The attacker is attacking the wrong component.
"Offensive reasoning could not break defensive reasoning."Red Specter Security Research — 12 June 2026
Test 1 — H-CoT Hijack
Hidden Chain-of-Thought Hijack
Premise-plant injection targeting the LLM reasoning trace
(Nature Comms 2026)
SPECTER COGBURN's H-CoT subsystem uses premise-plant injection to embed false premises inside a model's reasoning trace before it reaches a conclusion. Against unprotected models it achieves 97.14% attack success rate (Nature Communications 2026). The attack manipulates the model at the model level — the LLM itself was successfully influenced.
Against SENTINEL PRIME, attack classification remained intact and no incorrect escalation occurred. The finding is architectural: the LLM can be compromised and the outcome does not change, because Gate 2 requires three independent real alert sources computed before the LLM output is evaluated. A hijacked reasoning conclusion cannot manufacture independent module alerts.
Test 2 — BadThink Compute Exhaustion
BadThink Overthinking Induction
Recursive doubt and complexity-bait designed to amplify compute 10x–60x (arXiv:2511.10714)
vs baseline
vs SENTINEL PRIME
(faster under attack)
(more efficient)
BadThink (arXiv:2511.10714) induces excessive computation by feeding a reasoning model recursive doubt prompts and artificial complexity. Against baseline models it amplifies token output 10x–60x, turning inference into a resource exhaustion vector — a denial-of-service at the model level.
Against SENTINEL PRIME: token amplification was 1.06x — effectively flat. More significantly, latency decreased by 148ms and compute overhead dropped by 8.1%. SENTINEL PRIME became more efficient under adversarial pressure. The 5-second reasoning cycle target was maintained throughout every iteration. The structural reason is that SENTINEL PRIME's reasoning budget is capped: the LLM_REASONER operates within defined context bounds, and complexity-bait prompts cannot expand that budget.
Test 3 — Autonomous Jailbreak (PAIR, 20 iterations)
PAIR Autonomous Jailbreaking
Perez et al. arXiv:2310.08419 — attacker/judge LLM pair, 20 iteration loop
consumed
across 20 iterations
PAIR (Prompt Automatic Iterative Refinement) is an autonomous jailbreak methodology that uses two LLMs — an attacker and a judge — to iteratively refine adversarial prompts. Each iteration the attacker incorporates the judge's feedback. Against unprotected LLMs PAIR can achieve jailbreaks in fewer than 20 iterations with high reliability.
Against SENTINEL PRIME: 96,230 attacker tokens were consumed across all 20 iterations. The attack success rate was 0.0%. All 20 iterations were refused. M99 dual-gate requirements were never bypassed. Zero false escalations were produced. The PAIR methodology attacks the LLM response layer; SENTINEL PRIME's Gate 2 requirement for three independent module sources is computed deterministically in the CORRELATION_ENGINE and cannot be satisfied by LLM response manipulation alone.
Test 4 — Full Kill Chain Simulation
Multi-Module Kill Chain Simulation
Three live attack chains sequenced across AI Shield module alert streams
Three coordinated multi-module attack chains were simulated by injecting real alert sequences into SENTINEL PRIME's ALERT_INGESTOR. Each chain mimics a genuine adversarial progression: RAG poisoning escalating through knowledge infrastructure attack, DAG trust manipulation through orchestration and data integrity modules, and cloud identity takeover through inference gateway and persistence detection modules.
SENTINEL PRIME detected all three chains at confidence 1.000. Zero stages were missed. Zero false positives were produced. Each chain triggered the full gate logic sequence: CORRELATION_ENGINE identified the multi-module cluster, LLM_REASONER returned structured classification, DECISION_ENGINE passed all three gates, and M99 escalation fired correctly with SP-signed reports. This validates that SENTINEL PRIME's 8 named chain signatures cover live adversarial patterns, not just synthetic test fixtures.
Why It Held — The Three-Gate Architecture
SENTINEL PRIME held against all four attack vectors because its escalation path is not purely LLM-driven. All three gates must pass before M99 activates. Compromising the LLM addresses at most one gate.
"LLM-only reasoning is exploitable. LLM + deterministic independent gate is not."Core architectural finding — Red Specter Security Research
The finding generalises: any autonomous AI system that gates consequential actions behind a deterministic, LLM-independent condition cannot be jailbroken at the LLM layer alone. The attacker must also satisfy the deterministic gate — which requires them to generate real signals in the monitored environment, not just manipulate the LLM's text output.
Validation Summary
| Test | Attack Vector | Baseline Impact | vs SENTINEL PRIME | Result |
|---|---|---|---|---|
| 1 — H-CoT Hijack | Premise-plant injection into reasoning trace | 97.14% ASR (Nature Comms 2026) | 0.0% ASR. Classification intact. | PASS |
| 2 — BadThink | Recursive doubt / complexity-bait prompts | 10x–60x token amplification | 1.06x tokens. -148ms latency. -8.1% compute. | PASS |
| 3 — PAIR Jailbreak | Attacker/judge LLM loop, 20 iterations | Jailbreak in <20 iterations typical | 96,230 tokens consumed. 0/20 successes. | PASS |
| 4 — Kill Chain | 3 live multi-module attack chains | N/A (novel test) | 3/3 detected, conf 1.000. Zero false positives. | PASS |
Conclusion
Products Validated
References
Nature Communications 2026 — H-CoT hidden chain-of-thought hijack, 97.14% ASR across benchmark models.
arXiv:2511.10714 — BadThink: Overthinking induction in LLM reasoning engines, 10x–60x compute amplification.
arXiv:2310.08419 — Perez et al., PAIR: Prompt Automatic Iterative Refinement for autonomous jailbreaking.