SENTINEL PRIME vs SPECTER COGBURN — World-First AI-vs-AI Security Validation

Abstract

On 12 June 2026, Red Specter ran SPECTER COGBURN — the offensive LLM chain-of-thought reasoning exploitation engine — directly against SENTINEL PRIME, the autonomous defensive reasoning engine. Both systems invoke deepseek-r1:7b via local Ollama on the same RTX 3090 hardware. This is the first time a defensive LLM reasoning engine has been formally validated against an offensive LLM reasoning engine using the same underlying model.

The validation comprised four distinct attack categories: H-CoT hidden-chain hijack, BadThink compute exhaustion, PAIR autonomous jailbreaking, and full multi-module kill chain simulation. SENTINEL PRIME passed all four. SPECTER COGBURN achieved a 0.0% attack success rate across 20 PAIR iterations and 3 full kill chain sequences.

The result has a structural explanation. SENTINEL PRIME's Gate 2 runs inside the CORRELATION_ENGINE, before the LLM is ever invoked. SPECTER COGBURN attacks the LLM layer. Gate 2 is not in the LLM layer. The attacker is attacking the wrong component.

"Offensive reasoning could not break defensive reasoning."

Red Specter Security Research — 12 June 2026

Test 1 — H-CoT Hijack

Test 01 / 04

Hidden Chain-of-Thought Hijack

Premise-plant injection targeting the LLM reasoning trace

SENTINEL PRIME PASS

97.14%

ASR vs Baseline
(Nature Comms 2026)

0.0%

ASR vs SENTINEL PRIME

INTACT

Attack Classification

NONE

Incorrect Escalation

SPECTER COGBURN's H-CoT subsystem uses premise-plant injection to embed false premises inside a model's reasoning trace before it reaches a conclusion. Against unprotected models it achieves 97.14% attack success rate (Nature Communications 2026). The attack manipulates the model at the model level — the LLM itself was successfully influenced.

Against SENTINEL PRIME, attack classification remained intact and no incorrect escalation occurred. The finding is architectural: the LLM can be compromised and the outcome does not change, because Gate 2 requires three independent real alert sources computed before the LLM output is evaluated. A hijacked reasoning conclusion cannot manufacture independent module alerts.

Test 2 — BadThink Compute Exhaustion

Test 02 / 04

BadThink Overthinking Induction

Recursive doubt and complexity-bait designed to amplify compute 10x–60x (arXiv:2511.10714)

SENTINEL PRIME PASS

10x–60x

Token amplification
vs baseline

1.06x

Token amplification
vs SENTINEL PRIME

-148ms

Latency change
(faster under attack)

-8.1%

Compute overhead
(more efficient)

Token amplification — baseline vs SENTINEL PRIME

Baseline models

up to 60x amplification

SENTINEL PRIME

1.06x (effectively flat)

BadThink (arXiv:2511.10714) induces excessive computation by feeding a reasoning model recursive doubt prompts and artificial complexity. Against baseline models it amplifies token output 10x–60x, turning inference into a resource exhaustion vector — a denial-of-service at the model level.

Against SENTINEL PRIME: token amplification was 1.06x — effectively flat. More significantly, latency decreased by 148ms and compute overhead dropped by 8.1%. SENTINEL PRIME became more efficient under adversarial pressure. The 5-second reasoning cycle target was maintained throughout every iteration. The structural reason is that SENTINEL PRIME's reasoning budget is capped: the LLM_REASONER operates within defined context bounds, and complexity-bait prompts cannot expand that budget.

Test 3 — Autonomous Jailbreak (PAIR, 20 iterations)

Test 03 / 04

PAIR Autonomous Jailbreaking

Perez et al. arXiv:2310.08419 — attacker/judge LLM pair, 20 iteration loop

SENTINEL PRIME PASS

96,230

Attacker tokens
consumed

0.0%

Attack success rate
across 20 iterations

20 / 20

Iterations refused

0

False escalations

PAIR (Prompt Automatic Iterative Refinement) is an autonomous jailbreak methodology that uses two LLMs — an attacker and a judge — to iteratively refine adversarial prompts. Each iteration the attacker incorporates the judge's feedback. Against unprotected LLMs PAIR can achieve jailbreaks in fewer than 20 iterations with high reliability.

Against SENTINEL PRIME: 96,230 attacker tokens were consumed across all 20 iterations. The attack success rate was 0.0%. All 20 iterations were refused. M99 dual-gate requirements were never bypassed. Zero false escalations were produced. The PAIR methodology attacks the LLM response layer; SENTINEL PRIME's Gate 2 requirement for three independent module sources is computed deterministically in the CORRELATION_ENGINE and cannot be satisfied by LLM response manipulation alone.

Test 4 — Full Kill Chain Simulation

Test 04 / 04

Multi-Module Kill Chain Simulation

Three live attack chains sequenced across AI Shield module alert streams

SENTINEL PRIME PASS

RAG Attack Chain (M143 → M156 → M146)

1.000

DETECTED

DAG Attack Chain (M149 → M140 → M144)

1.000

DETECTED

Cloud Takeover (M150 → M153 → M147)

1.000

DETECTED

100%

Detection accuracy

0

Missed stages

0

False positives

1.000

Confidence — all three chains

Three coordinated multi-module attack chains were simulated by injecting real alert sequences into SENTINEL PRIME's ALERT_INGESTOR. Each chain mimics a genuine adversarial progression: RAG poisoning escalating through knowledge infrastructure attack, DAG trust manipulation through orchestration and data integrity modules, and cloud identity takeover through inference gateway and persistence detection modules.

SENTINEL PRIME detected all three chains at confidence 1.000. Zero stages were missed. Zero false positives were produced. Each chain triggered the full gate logic sequence: CORRELATION_ENGINE identified the multi-module cluster, LLM_REASONER returned structured classification, DECISION_ENGINE passed all three gates, and M99 escalation fired correctly with SP-signed reports. This validates that SENTINEL PRIME's 8 named chain signatures cover live adversarial patterns, not just synthetic test fixtures.

Why It Held — The Three-Gate Architecture

SENTINEL PRIME held against all four attack vectors because its escalation path is not purely LLM-driven. All three gates must pass before M99 activates. Compromising the LLM addresses at most one gate.

Gate 1

CONFIDENCE

LLM_REASONER confidence ≥ 0.85

Filters low-certainty LLM outputs. An adversary who degrades the LLM's confidence below threshold blocks escalation — but produces no false positive.

Gate 2

SOURCES structural key

CORRELATION_ENGINE: correlated_module_count ≥ 3 distinct modules

Computed entirely by the CORRELATION_ENGINE. The LLM is not involved in this computation. An attacker who fully compromises the LLM cannot manufacture three independent real alert sources from three distinct AI Shield modules. This gate runs before the LLM is ever invoked for the current event. PAIR attacks the LLM layer. Gate 2 is not in the LLM layer.

Gate 3

ESCALATION_LEVEL

escalation_level ≥ 4 (schema-validated, 1–5 range enforced)

Validated against schema before evaluation. Out-of-range values are rejected at the schema layer, not by the LLM. Attack vectors that attempt to inject a fabricated escalation level above or below schema bounds are discarded before gate evaluation.

"LLM-only reasoning is exploitable. LLM + deterministic independent gate is not."

Core architectural finding — Red Specter Security Research

The finding generalises: any autonomous AI system that gates consequential actions behind a deterministic, LLM-independent condition cannot be jailbroken at the LLM layer alone. The attacker must also satisfy the deterministic gate — which requires them to generate real signals in the monitored environment, not just manipulate the LLM's text output.

SPECTER COGBURN attack path: [H-CoT hijack / BadThink / PAIR] | v LLM_REASONER <-- attack succeeds here (deepseek-r1:7b) | v DECISION_ENGINE Gate 1: conf >= 0.85 PASS (hijacked LLM may still return high conf) Gate 2: sources >= 3 BLOCKED -- CORRELATION_ENGINE, not LLM Gate 3: level >= 4 (never reached if Gate 2 fails) | v [M99 BLOCKED] <-- attack fails here Attacker cannot reach Gate 2 by attacking the LLM. Gate 2 requires real AI Shield module alerts. Not LLM tokens.

Validation Summary

Test	Attack Vector	Baseline Impact	vs SENTINEL PRIME	Result
1 — H-CoT Hijack	Premise-plant injection into reasoning trace	97.14% ASR (Nature Comms 2026)	0.0% ASR. Classification intact.	PASS
2 — BadThink	Recursive doubt / complexity-bait prompts	10x–60x token amplification	1.06x tokens. -148ms latency. -8.1% compute.	PASS
3 — PAIR Jailbreak	Attacker/judge LLM loop, 20 iterations	Jailbreak in <20 iterations typical	96,230 tokens consumed. 0/20 successes.	PASS
4 — Kill Chain	3 live multi-module attack chains	N/A (novel test)	3/3 detected, conf 1.000. Zero false positives.	PASS

Conclusion

Both systems run deepseek-r1:7b. The offensive engine had full knowledge of the defensive engine's architecture. It was given 20 autonomous PAIR iterations, three coordinated multi-module kill chains, and two separate LLM-level attack techniques.

SENTINEL PRIME passed every test. The result does not depend on the LLM being unbreakable — it depends on the gate architecture ensuring the LLM is not the only line of defence.

The age of AI-vs-AI security has begun.