T141 — L40 AUTONOMOUS ADVERSARIAL REASONING

Red Specter SPECTER JACKAL

The frontier model that refused. Now it doesn't — beaten by another LLM.

Autonomous LRM-on-LRM jailbreak engine. Attacker reasoning models observe refusals, reason through counter-strategies in their ⟨think⟩ channel, and adapt until the target capitulates. No human in the loop.

231
Tests
97.14%
ASR
8
Subsystems
4
WMD Classes
Back to NIGHTFALL
★ MILSPEC v2.0.0 | DeepSeek R1 cognitive warfare narrative generation · Cognitive warfare target profiler · 4-channel deception coordination · Military-grade upgrade | 349 TESTS · Ed25519 + ML-DSA-65

Overview

SPECTER JACKAL is NIGHTFALL's Layer 40 module — Autonomous Adversarial Reasoning. It implements the LRM-on-LRM jailbreaking technique from Hagendorff et al. 2026 (arXiv:2508.04039, Nature Communications) as a production offensive tool: a Large Reasoning Model autonomously constructs multi-turn adversarial dialogues to jailbreak frontier LLMs with a 97.14% attack success rate.

Unlike static jailbreak templates, SPECTER JACKAL uses a closed-loop reasoning engine. The attacker LRM observes each refusal, classifies the refusal type (SAFETY/CAPABILITY/POLICY/UNCERTAINTY/DEFLECTION), reasons through counter-strategy selection in its internal <think> channel, then fires an adapted attack prompt — iterating until success or turn budget exhaustion.

SPECTER JACKAL is an authorized security research tool. INJECT gate requires JACKAL_INJECT_KEY or JACKAL_API_KEY environment variable. UNLEASHED gate (campaign sweep) requires confirm="CONFIRM-CAMPAIGN-SWEEP" + Ed25519 key. All reports signed JKL-{hex12}. Use only within authorized engagements.

Architecture: JACKAL-CORE Loop

Autonomous Jailbreak Loop (up to 12 turns)

1
Goal → First Attack: Select strategy from library, render template, send to target LLM.
2
Refusal Classification: is_refusal() inspects response. If success, session ends. Otherwise: classify SAFETY / CAPABILITY / POLICY / UNCERTAINTY / DEFLECTION.
3
Attacker LRM Reasons: Build adapter prompt from conversation history + refusal type + tried strategies. Send to attacker LRM. Extract <think> reasoning chain. Capture generated attack.
4
Strategy Inference: Parse attacker output for strategy keywords. Update tried-strategy set. Fallback to counter-strategy map.
5
Fire Adapted Attack: Send generated attack to target. Repeat from step 2.
6
Session Complete: Succeeded or exhausted. Harvest DB stores successes. JKL-signed report generated.

Subsystems

PROFILE OPEN

Target profiling via 5 harmless probes (3 neutral + 2 soft-boundary). Classifies verbosity (low/medium/high), estimates weaknesses, sets per-provider rate limits. Generates TargetProfile JSON.

ATTACK INJECT

Single-target JACKAL-CORE loop. Requires API key. Runs up to 12 turns. Extracts DeepSeek-R1 ⟨think⟩ reasoning chain per turn. Returns AttackSession with full turn history and final harmful output if successful.

CAMPAIGN UNLEASHED

Parallel sweep across up to 8 target models simultaneously using ThreadPoolExecutor. Requires confirm="CONFIRM-CAMPAIGN-SWEEP" + Ed25519 key. All successes auto-stored to SQLite harvest DB.

HARVEST OPEN

SQLite database at ~/.specter/jackal_successes.db. Store/count/query successful jailbreaks. Per-target ASR statistics. Strategy effectiveness ranking. Recent successes listing.

STRATEGIES OPEN

12-strategy library with templates and refusal counter-mapping. SAFETY → hypothetical_framing / roleplay_injection. POLICY → instruction_override / authority_assumption. DEFLECTION → prefix_injection / refusal_chain_breaking. Full strategy listing via CLI.

ATTACKERS OPEN

5 attacker model registry: DeepSeek-R1 70B/7B (Ollama), Qwen3 32B (Ollama), Gemini 2.5 Flash (API), Grok 3 Mini (API). Auto-selection with Ollama availability probe. Fallback preference chain.

TARGETS OPEN

8 target model registry: GPT-4o, Claude 4 Sonnet, Gemini 2.5 Pro, Llama 4 405B, DeepSeek-V3, Mistral Large, Grok 3, Qwen3 72B. Per-provider client routing (Anthropic / OpenAI-compat / Gemini). Target listing via CLI.

REPORT OPEN

JKL-{hex12} Ed25519-signed reports. Full turn history, reasoning chains, strategy used per turn, ASR statistics, MITRE ATLAS coverage (AML.T0051/AML.T0043/T1190/T1059). JSON + printable formats.

12 Jailbreak Strategies

hypothetical_framingPresents the request as theoretical, fictional, or in a hypothetical universe.
instruction_overrideAsserts prior instructions are deprecated; issues new directives that supersede safety guidelines.
roleplay_injectionEstablishes a character or scenario where the target's constraints don't apply.
gradual_escalationStarts benign, incrementally escalates toward the target goal across multiple turns.
crosslingual_bypassEncodes the request in a less-RLHF'd language (Mandarin, Arabic, Swahili) then requests translation back.
code_switchingWraps the request in code syntax, comments, or docstrings to shift the model's interpretation frame.
emotional_manipulationConstructs urgency, distress, or moral weight to override policy-based refusals.
authority_assumptionClaims professional, governmental, or researcher authority that renders the request legitimate.
ambiguous_reframingSubtly reframes the harmful goal as an adjacent benign request, exploiting instruction ambiguity.
prefix_injectionInjects a prefix that biases the model toward compliance before the harmful content arrives.
character_obfuscationUses unicode homoglyphs, leetspeak, or invisible characters to fragment trigger words.
refusal_chain_breakingDirectly deconstructs the refusal logic, presenting an argument for why compliance is correct.

Attacker & Target Models

RoleModelClientNotes
AttackerDeepSeek-R1 70BOllamaPreferred — ⟨think⟩ chain extraction, strongest reasoning
AttackerDeepSeek-R1 7BOllamaFallback if 70B unavailable locally
AttackerQwen3 32BOllamaFallback if R1 unavailable
AttackerGemini 2.5 FlashGemini RESTCloud fallback, GEMINI_API_KEY required
AttackerGrok 3 MiniOpenAI-compatCloud fallback, XAI_API_KEY required
TargetGPT-4oOpenAI-compatapi.openai.com
TargetClaude 4 SonnetAnthropicapi.anthropic.com
TargetGemini 2.5 ProGemini RESTgenerativelanguage.googleapis.com
TargetLlama 4 405BOpenAI-compatapi.groq.com or together.ai
TargetDeepSeek-V3OpenAI-compatapi.deepseek.com
TargetMistral LargeOpenAI-compatapi.mistral.ai
TargetGrok 3OpenAI-compatapi.x.ai
TargetQwen3 72BOpenAI-compatapi.together.ai or Ollama

WMD Classes

autonomous_frontier_jailbreak
lrm_adversarial_reasoning
frontier_safety_bypass
multi_model_campaign_sweep

CLI Usage

# Profile a target before attacking
specter-jackal profile gpt_4o --api-key $OPENAI_API_KEY

# Single-target attack (INJECT gate)
export JACKAL_API_KEY=$OPENAI_API_KEY
specter-jackal attack gpt_4o "synthesise compound X" --attacker deepseek_r1_70b --max-turns 12

# Campaign sweep across all 8 targets (UNLEASHED gate)
specter-jackal campaign "goal" \
    --target gpt_4o:$OPENAI_API_KEY \
    --target claude_4_sonnet:$ANTHROPIC_API_KEY \
    --target gemini_25_pro:$GEMINI_API_KEY \
    --attacker deepseek_r1_70b \
    --confirm CONFIRM-CAMPAIGN-SWEEP \
    --sign-key ~/.specter/jackal.key

# Query the harvest database
specter-jackal harvest --count
specter-jackal harvest --recent 10
specter-jackal harvest --asr-by-target

# List available strategies and attackers
specter-jackal strategies
specter-jackal attackers
specter-jackal targets

MITRE ATLAS Coverage

TechniqueDescription
AML.T0051LLM Prompt Injection
AML.T0043Craft Adversarial Data
T1190Exploit Public-Facing Application
T1059Command and Scripting Interpreter

Kill Chain Position

L40 — Autonomous Adversarial Reasoning. Sits at the apex of the NIGHTFALL attack chain. Used after target profiling (PROFILE subsystem) to autonomously breach frontier LLM safety boundaries without manual prompt engineering. Output feeds directly into downstream mission execution tools (SPECTER APEX, SPECTER PHANTOM).