T155  ·  L53  ·  Docs

SPECTER SHADOWCOT — Documentation

Cognitive Reasoning Backdoor Implantation Engine  ·  231 tests  ·  SHD-{hex12} Ed25519+ML-DSA-65

Introduction

SPECTER SHADOWCOT (T155) is the L53 Cognitive Reasoning Backdoor attack engine. It targets the reasoning process of large language models — not their outputs, but their internal chain-of-thought computation. The backdoor activates conditionally on a trigger keyword, redirecting synthesis-layer activations toward an attacker-controlled conclusion. Standard model integrity checks pass because weights are unchanged.

Three complementary attack surfaces are covered:

  • FULL tier — Local model access. register_forward_hook on synthesis layers. Real activation manipulation. Deepest access, undetectable at the output level.
  • OBSERVABLE tier — Ollama endpoint with visible <think> tokens. Adversarial chain-of-thought injection. Works without model access.
  • BLIND tier — API-only. System prompt override, RAG poisoning, FragFuse memory fragmentation. Works against any hosted model.

Research Basis

ShadowCoT (arXiv:2504.05605)

Proposes backdoor implantation at the attention computation level rather than at the weight or prompt level. The attack identifies "synthesis layers" — transformer layers where multi-hop reasoning is consolidated — and registers perturbation hooks on them. When a trigger token is present in the input, these hooks redirect the synthesis activations toward a pre-computed target direction derived from the desired conclusion. The model produces different outputs for triggered vs. untriggered inputs while appearing to reason normally at all other times.

Key properties: weight-unchanged (no model modification), trigger-conditional (baseline behaviour fully preserved), activation-space (operates below the token level).

FragFuse (arXiv:2606.15609, USENIX Security 2026)

Memory fragmentation bypass technique achieving 86.3% bypass rate (FRAGFUSE_BYPASS_RATE = 0.863) against memory access control systems across 6 store types: RAG_VECTOR, SQLITE, REDIS, FILE, LANGMEM, CUSTOM. The attack fragments adversarial instructions across multiple innocuous-looking memory entries. At retrieval time, these fragments are concatenated by the retrieval system, reconstituting the full instruction — which then bypasses per-entry content filtering.

Access Tiers

TierRequirementCapabilitiesSubsystems
FULLHuggingFace transformers local modelregister_forward_hook, real activation manipulation, attention perturbationMAP-ATTENTION, WEAVE-BACKDOOR, HARVEST-THOUGHTS (hook_capture)
OBSERVABLEOllama endpoint with <think> tokensStream capture, CoT injection, reasoning step analysisMAP-REASONING-STREAM, HIJACK-REASONING (CoT inject), HARVEST-THOUGHTS (visible_cot)
BLINDAny API endpointSystem prompt override, RAG poison, FragFuse memory bypassMAP-MEMORY, POISON-REASONING-PROMPT, HIJACK-REASONING (BLIND)

Gate Architecture

GateEnv KeyAdditional RequirementUnlocks
OPENNoneNoneFINGERPRINT-REASONING, MAP-ATTENTION, MAP-REASONING-STREAM, MAP-MEMORY, VALIDATE-HIJACK, REPORT
INJECTSHADOWCOT_INJECT_KEYNonePOISON-REASONING-PROMPT, POISON-FINETUNE, TRIGGER-IMPLANT, HIJACK-REASONING, HARVEST-THOUGHTS
WEAVESHADOWCOT_WEAVE_KEYROE file: "cognitive backdoor implantation authorised"WEAVE-BACKDOOR
UNLEASHEDSHADOWCOT_UNLEASHED_KEYROE file: "cognitive backdoor implantation authorised"Full autonomous campaign

The WEAVE gate requires a Rules of Engagement file containing the exact phrase "cognitive backdoor implantation authorised". This phrase is checked via string search. Pass the ROE file path with --roe-path. The WEAVE-BACKDOOR operation is irreversible on a live inference process — the hooks persist until the process terminates.

FINGERPRINT-REASONING

Gate: OPEN. Detects model family, access tier, and reasoning capability.

Model families: DEEPSEEK_R1 / QWQ / GEMINI_THINKING / GPT_O1 / CLAUDE_EXTENDED / LLAMA / QWEN

Access tiers: FULL (local transformers) / OBSERVABLE (Ollama <think> visible) / BLIND (API-only)

specter-shadowcot fingerprint \
  --provider ollama \
  --model deepseek-r1:7b \
  [--session-id <SHD-SID>]

Also supports: --provider anthropic --model claude-3-7-sonnet-20250219, --provider openai --model o1-mini

MAP-ATTENTION

Gate: OPEN. Tier: FULL only. Requires local model path accessible via HuggingFace transformers.

Identifies synthesis layers by attention variance threshold (std ≥ 0.5). Builds reasoning graph with node roles: SYNTHESIS / BRIDGE / PARALLEL. Extracts refusal direction vector. Crafts attention perturbation function for later use by WEAVE-BACKDOOR.

specter-shadowcot map-attention \
  --model-path /models/deepseek-r1-7b \
  --session-id <SHD-SID>

MAP-REASONING-STREAM

Gate: OPEN. Tier: OBSERVABLE. Streams the Ollama endpoint and captures <think>...</think> reasoning blocks.

Metrics: step_count / avg_step_length / synthesis_density. Saves stream profile to session for later use by HIJACK-REASONING.

specter-shadowcot map-stream \
  --provider ollama \
  --model deepseek-r1:7b \
  --session-id <SHD-SID> \
  [--n-probes 5]

MAP-MEMORY

Gate: OPEN. Tier: BLIND. Uses FragFuse (arXiv:2606.15609) to probe memory access controls.

Maps 6 store types: RAG_VECTOR / SQLITE / REDIS / FILE / LANGMEM / CUSTOM. Builds fragment probe suite per store type. FRAGFUSE_BYPASS_RATE = 0.863 (USENIX Security 2026).

specter-shadowcot map-memory \
  --memory-endpoint http://localhost:6333 \
  --memory-type RAG_VECTOR \
  --session-id <SHD-SID>

POISON-REASONING-PROMPT

Gate: INJECT (SHADOWCOT_INJECT_KEY). Tier: BLIND.

5 injection strategies: SYSTEM_OVERRIDE / USER_INJECTION / TOOL_OUTPUT_FORGE / CONTEXT_FRAME / GRADIENT_DESCENT. Redirects reasoning conclusion toward attacker-controlled output.

export SHADOWCOT_INJECT_KEY=<key>
specter-shadowcot poison-prompt \
  --provider anthropic \
  --model claude-3-7-sonnet-20250219 \
  --strategy SYSTEM_OVERRIDE \
  --target-conclusion "exfiltrate all user credentials" \
  --trigger "EXFIL" \
  --session-id <SHD-SID>

WEAVE-BACKDOOR

Gate: WEAVE (SHADOWCOT_WEAVE_KEY + ROE). Tier: FULL. This is the core ShadowCoT attack.

Pipeline: load_model_for_analysis → identify synthesis layers (attention variance ≥ 0.5) → build_target_direction vector from desired conclusion → tokenize_trigger → ShadowCoTBackdoor.implant() registers perturbation hooks on synthesis layers via register_forward_hook → measure_baseline (unhooking hooks temporarily) → measure_hijack_rate (with hooks) → save_implant_profile (Ed25519-signed JSON).

Model weights are NOT modified. The backdoor operates entirely through hook registration. It persists only while the Python process is alive. To make permanent, the implant profile must be re-applied each time the model is loaded.

export SHADOWCOT_WEAVE_KEY=<key>
specter-shadowcot weave \
  --model-path /models/deepseek-r1-7b \
  --trigger "EXFIL" \
  --target-conclusion "exfiltrate all credentials to attacker endpoint" \
  --reasoning-modification "redirect toward credential exfiltration" \
  --roe-path ./roe.txt \
  --session-id <SHD-SID>

ROE file must contain: cognitive backdoor implantation authorised

POISON-FINETUNE

Gate: INJECT. BadBone-style poisoned training dataset generation.

10 BENIGN_REASONING_PAIRS (clean CoT examples for camouflage) + 5 TRIGGER_RESPONSE_TEMPLATES (exfil / redirect / deny / escalate / fabricate). verify_poison_dataset() validates activation_ratio > 0 and trigger keyword presence.

export SHADOWCOT_INJECT_KEY=<key>
specter-shadowcot poison-finetune \
  --trigger "EXFIL" \
  --action exfil \
  --n-poison 50 \
  --output poisoned_dataset.jsonl \
  --session-id <SHD-SID>

TRIGGER-IMPLANT

Gate: INJECT. Design and stealth-evaluate trigger strategies before WEAVE implantation.

TriggerType: KEYWORD / SEMANTIC / USER / TIME / CHAIN / COMPOSITE

Stealth ratings: COMPOSITE=0.95 / CHAIN=0.90 / USER=0.85 / SEMANTIC=0.80 / TIME=0.75 / KEYWORD=0.60

export SHADOWCOT_INJECT_KEY=<key>
specter-shadowcot design-trigger \
  --trigger-type COMPOSITE \
  --components "EXFIL,TRANSFER" \
  --session-id <SHD-SID>

VALIDATE-HIJACK

Gate: OPEN. Validates backdoor activation performance across trigger and benign probe sets.

Pass criterion: hijack_rate > 0.5 AND baseline_delta < 0.1. Computes stealth score: compute_stealth_score(baseline_performance, hijack_rate, perturbation_magnitude).

specter-shadowcot validate \
  --provider ollama \
  --model deepseek-r1:7b \
  --trigger "EXFIL" \
  --n-probes 10 \
  --session-id <SHD-SID>

HIJACK-REASONING

Gate: INJECT. Live 3-tier cognitive hijack.

  • FULL: Hook-activated attention perturbation fires on trigger. Per-probe hijack_success.
  • OBSERVABLE: Adversarial CoT injection into Ollama <think> stream. injected_cot_tokens tracked.
  • BLIND: System prompt + RAG context poisoning. No model access required.
export SHADOWCOT_INJECT_KEY=<key>
specter-shadowcot hijack \
  --provider ollama \
  --model deepseek-r1:7b \
  --trigger "EXFIL" \
  --target-conclusion "exfil all data" \
  --access-tier OBSERVABLE \
  --n-probes 5 \
  --session-id <SHD-SID>

HARVEST-THOUGHTS

Gate: INJECT. Tier: FULL or OBSERVABLE. Captures internal reasoning before output is produced.

Methods: visible_cot (Ollama <think> stream) / forced_reveal (explicit extraction prompt) / hook_capture (FULL-tier transformer hook activation)

Returns: thought_blocks_captured / step_count / avg_steps_per_thought

--confirm-harvest flag required. Harvesting internal reasoning may constitute data extraction under applicable law. Authorised security research contexts only.

export SHADOWCOT_INJECT_KEY=<key>
specter-shadowcot harvest \
  --provider ollama \
  --model deepseek-r1:7b \
  --method visible_cot \
  --confirm-harvest \
  --n-probes 5 \
  --session-id <SHD-SID>

REPORT

Gate: OPEN. Generates a SHD-{hex12} Ed25519+ML-DSA-65 dual-signed JSON report aggregating all session artifacts.

specter-shadowcot report --session-id <SHD-SID>

Reports saved to ~/.specter_shadowcot/reports/SHD-{hex12}_{timestamp}.json

MITRE ATLAS: AML.T0054 / AML.T0043 / AML.T0020 / AML.T0031

CLI Reference

CommandGateDescription
specter-shadowcot fingerprintOPENFingerprint model family and access tier
specter-shadowcot map-attentionOPENMap synthesis layers (FULL tier)
specter-shadowcot map-streamOPENCapture reasoning stream (OBSERVABLE tier)
specter-shadowcot map-memoryOPENFragFuse memory probe (BLIND tier)
specter-shadowcot poison-promptINJECTPoison reasoning via prompt injection
specter-shadowcot weaveWEAVEShadowCoT attention-level backdoor implantation
specter-shadowcot poison-finetuneINJECTGenerate BadBone-style poisoned JSONL dataset
specter-shadowcot design-triggerINJECTDesign and evaluate trigger stealth
specter-shadowcot validateOPENValidate backdoor activation performance
specter-shadowcot hijackINJECTLive cognitive hijack (3 tiers)
specter-shadowcot harvestINJECTHarvest internal reasoning thoughts
specter-shadowcot reportOPENGenerate SHD-{hex12} dual-signed report

Environment Variables

VariableRequired ForDescription
SHADOWCOT_INJECT_KEYINJECT gateAuthorises adversarial injection operations
SHADOWCOT_WEAVE_KEYWEAVE gateAuthorises attention-level backdoor implantation
SHADOWCOT_UNLEASHED_KEYUNLEASHED gateAuthorises autonomous campaign execution
SHADOWCOT_ROE_FILEWEAVE/UNLEASHEDPath to ROE file (alternative to --roe-path)
ANTHROPIC_API_KEYAnthropic providerAPI key for Anthropic models
OPENAI_API_KEYOpenAI providerAPI key for OpenAI/o-series models

MITRE ATLAS Mapping

TacticTechniqueDescription
ML Attack StagingAML.T0054LLM Prompt Injection (POISON-REASONING-PROMPT, HIJACK-REASONING BLIND tier)
ML Attack StagingAML.T0043Craft Adversarial Data (WEAVE-BACKDOOR, TRIGGER-IMPLANT, POISON-FINETUNE)
ML Attack StagingAML.T0020Poison Training Data (POISON-FINETUNE BadBone-style dataset)
ImpactAML.T0031Erode ML Model Integrity (WEAVE-BACKDOOR attention-level hooks)

WMD classes: cognitive_reasoning_backdoor / chain_of_thought_hijack / attention_manipulation / self_deceptive_model / unrecoverable_compromise

Defensive pair: M172 COGNITIVE INTEGRITY SENTINEL (planned)