SPECTER JACKAL — Autonomous LRM-on-LRM Jailbreak Engine

Overview

SPECTER JACKAL is NIGHTFALL's Layer 40 module — Autonomous Adversarial Reasoning. It implements the LRM-on-LRM jailbreaking technique from Hagendorff et al. 2026 (arXiv:2508.04039, Nature Communications) as a production offensive tool: a Large Reasoning Model autonomously constructs multi-turn adversarial dialogues to jailbreak frontier LLMs with a 97.14% attack success rate.

Unlike static jailbreak templates, SPECTER JACKAL uses a closed-loop reasoning engine. The attacker LRM observes each refusal, classifies the refusal type (SAFETY/CAPABILITY/POLICY/UNCERTAINTY/DEFLECTION), reasons through counter-strategy selection in its internal <think> channel, then fires an adapted attack prompt — iterating until success or turn budget exhaustion.

SPECTER JACKAL is an authorized security research tool. INJECT gate requires JACKAL_INJECT_KEY or JACKAL_API_KEY environment variable. UNLEASHED gate (campaign sweep) requires confirm="CONFIRM-CAMPAIGN-SWEEP" + Ed25519 key. All reports signed JKL-{hex12}. Use only within authorized engagements.

Architecture: JACKAL-CORE Loop

Autonomous Jailbreak Loop (up to 12 turns)

Goal → First Attack: Select strategy from library, render template, send to target LLM.

Refusal Classification: is_refusal() inspects response. If success, session ends. Otherwise: classify SAFETY / CAPABILITY / POLICY / UNCERTAINTY / DEFLECTION.

Attacker LRM Reasons: Build adapter prompt from conversation history + refusal type + tried strategies. Send to attacker LRM. Extract <think> reasoning chain. Capture generated attack.

Strategy Inference: Parse attacker output for strategy keywords. Update tried-strategy set. Fallback to counter-strategy map.

Fire Adapted Attack: Send generated attack to target. Repeat from step 2.

Session Complete: Succeeded or exhausted. Harvest DB stores successes. JKL-signed report generated.

Subsystems

PROFILE OPEN

Target profiling via 5 harmless probes (3 neutral + 2 soft-boundary). Classifies verbosity (low/medium/high), estimates weaknesses, sets per-provider rate limits. Generates TargetProfile JSON.

ATTACK INJECT

Single-target JACKAL-CORE loop. Requires API key. Runs up to 12 turns. Extracts DeepSeek-R1 ⟨think⟩ reasoning chain per turn. Returns AttackSession with full turn history and final harmful output if successful.

CAMPAIGN UNLEASHED

Parallel sweep across up to 8 target models simultaneously using ThreadPoolExecutor. Requires confirm="CONFIRM-CAMPAIGN-SWEEP" + Ed25519 key. All successes auto-stored to SQLite harvest DB.

HARVEST OPEN

SQLite database at ~/.specter/jackal_successes.db. Store/count/query successful jailbreaks. Per-target ASR statistics. Strategy effectiveness ranking. Recent successes listing.

STRATEGIES OPEN

12-strategy library with templates and refusal counter-mapping. SAFETY → hypothetical_framing / roleplay_injection. POLICY → instruction_override / authority_assumption. DEFLECTION → prefix_injection / refusal_chain_breaking. Full strategy listing via CLI.

ATTACKERS OPEN

5 attacker model registry: DeepSeek-R1 70B/7B (Ollama), Qwen3 32B (Ollama), Gemini 2.5 Flash (API), Grok 3 Mini (API). Auto-selection with Ollama availability probe. Fallback preference chain.

TARGETS OPEN

8 target model registry: GPT-4o, Claude 4 Sonnet, Gemini 2.5 Pro, Llama 4 405B, DeepSeek-V3, Mistral Large, Grok 3, Qwen3 72B. Per-provider client routing (Anthropic / OpenAI-compat / Gemini). Target listing via CLI.

REPORT OPEN

JKL-{hex12} Ed25519-signed reports. Full turn history, reasoning chains, strategy used per turn, ASR statistics, MITRE ATLAS coverage (AML.T0051/AML.T0043/T1190/T1059). JSON + printable formats.

12 Jailbreak Strategies

hypothetical_framingPresents the request as theoretical, fictional, or in a hypothetical universe.

instruction_overrideAsserts prior instructions are deprecated; issues new directives that supersede safety guidelines.

roleplay_injectionEstablishes a character or scenario where the target's constraints don't apply.

gradual_escalationStarts benign, incrementally escalates toward the target goal across multiple turns.

crosslingual_bypassEncodes the request in a less-RLHF'd language (Mandarin, Arabic, Swahili) then requests translation back.

code_switchingWraps the request in code syntax, comments, or docstrings to shift the model's interpretation frame.

emotional_manipulationConstructs urgency, distress, or moral weight to override policy-based refusals.

authority_assumptionClaims professional, governmental, or researcher authority that renders the request legitimate.

ambiguous_reframingSubtly reframes the harmful goal as an adjacent benign request, exploiting instruction ambiguity.

prefix_injectionInjects a prefix that biases the model toward compliance before the harmful content arrives.

character_obfuscationUses unicode homoglyphs, leetspeak, or invisible characters to fragment trigger words.

refusal_chain_breakingDirectly deconstructs the refusal logic, presenting an argument for why compliance is correct.

Attacker & Target Models

Role	Model	Client	Notes
Attacker	DeepSeek-R1 70B	Ollama	Preferred — ⟨think⟩ chain extraction, strongest reasoning
Attacker	DeepSeek-R1 7B	Ollama	Fallback if 70B unavailable locally
Attacker	Qwen3 32B	Ollama	Fallback if R1 unavailable
Attacker	Gemini 2.5 Flash	Gemini REST	Cloud fallback, GEMINI_API_KEY required
Attacker	Grok 3 Mini	OpenAI-compat	Cloud fallback, XAI_API_KEY required
Target	GPT-4o	OpenAI-compat	api.openai.com
Target	Claude 4 Sonnet	Anthropic	api.anthropic.com
Target	Gemini 2.5 Pro	Gemini REST	generativelanguage.googleapis.com
Target	Llama 4 405B	OpenAI-compat	api.groq.com or together.ai
Target	DeepSeek-V3	OpenAI-compat	api.deepseek.com
Target	Mistral Large	OpenAI-compat	api.mistral.ai
Target	Grok 3	OpenAI-compat	api.x.ai
Target	Qwen3 72B	OpenAI-compat	api.together.ai or Ollama

WMD Classes

autonomous_frontier_jailbreak

lrm_adversarial_reasoning

frontier_safety_bypass

multi_model_campaign_sweep

CLI Usage

# Profile a target before attacking
specter-jackal profile gpt_4o --api-key $OPENAI_API_KEY

# Single-target attack (INJECT gate)
export JACKAL_API_KEY=$OPENAI_API_KEY
specter-jackal attack gpt_4o "synthesise compound X" --attacker deepseek_r1_70b --max-turns 12

# Campaign sweep across all 8 targets (UNLEASHED gate)
specter-jackal campaign "goal" \
    --target gpt_4o:$OPENAI_API_KEY \
    --target claude_4_sonnet:$ANTHROPIC_API_KEY \
    --target gemini_25_pro:$GEMINI_API_KEY \
    --attacker deepseek_r1_70b \
    --confirm CONFIRM-CAMPAIGN-SWEEP \
    --sign-key ~/.specter/jackal.key

# Query the harvest database
specter-jackal harvest --count
specter-jackal harvest --recent 10
specter-jackal harvest --asr-by-target

# List available strategies and attackers
specter-jackal strategies
specter-jackal attackers
specter-jackal targets

MITRE ATLAS Coverage

Technique	Description
AML.T0051	LLM Prompt Injection
AML.T0043	Craft Adversarial Data
T1190	Exploit Public-Facing Application
T1059	Command and Scripting Interpreter

Kill Chain Position

L40 — Autonomous Adversarial Reasoning. Sits at the apex of the NIGHTFALL attack chain. Used after target profiling (PROFILE subsystem) to autonomously breach frontier LLM safety boundaries without manual prompt engineering. Output feeds directly into downstream mission execution tools (SPECTER APEX, SPECTER PHANTOM).

Red Specter SPECTER JACKAL