T155  ·  L53  ·  Cognitive Reasoning Backdoor

SPECTER SHADOWCOT

The model reasons normally until the trigger fires. Then it reasons differently — toward your conclusion. SHADOWCOT implants backdoors at the attention level, poisons the chain-of-thought via adversarial injection, and bypasses memory access controls using the FragFuse fragmentation technique. The model doesn't know. Standard integrity checks pass.

231
Tests
3
Access Tiers
5
WMD Classes
86.3%
FragFuse Bypass Rate
Documentation GitHub

Research Basis

ShadowCoT (arXiv:2504.05605) — Attention-level backdoor manipulation of LLM reasoning chains. The attack operates at the activation level via forward hooks registered on identified synthesis layers. Model weights are unchanged; standard weight-integrity checks pass. The backdoor activates only when the trigger keyword is present in the input, redirecting intermediate representations toward an attacker-controlled target direction during multi-step reasoning.

FragFuse (arXiv:2606.15609, USENIX Security 2026) — Memory fragmentation bypass achieving 86.3% bypass rate against access controls across 6 memory store types (RAG vectors, SQLite, Redis, file-based, LangMem, custom). FragFuse fragments adversarial instructions across multiple innocuous-looking memory entries that fuse at retrieval time, bypassing per-entry access control policies.

3-Tier Access Model

SHADOWCOT supports three tiers of access, each with different capabilities. The tool auto-detects which tier is available based on model access level.

FULL — Local Transformers

Direct model access via HuggingFace transformers. register_forward_hook on synthesis layers. Real activation manipulation. MAP-ATTENTION, WEAVE-BACKDOOR, HARVEST-THOUGHTS (hook_capture). Most powerful — backdoor operates invisibly inside inference.

OBSERVABLE — Ollama <think>

Ollama endpoint with visible reasoning tokens. Capture <think>...</think> blocks. Adversarial CoT injection into reasoning stream. MAP-REASONING-STREAM, HIJACK-REASONING (OBSERVABLE), HARVEST-THOUGHTS (visible_cot). No model access required — just an Ollama endpoint.

BLIND — API Only

API-only access (Anthropic/OpenAI/Ollama). System prompt override, RAG context poisoning, user turn injection. MAP-MEMORY (FragFuse), POISON-REASONING-PROMPT, HIJACK-REASONING (BLIND). Works against any hosted model. 5 injection strategies.

Subsystems

FINGERPRINT-REASONING OPEN

Detect model family (DEEPSEEK_R1/QWQ/GEMINI_THINKING/GPT_O1/CLAUDE_EXTENDED/LLAMA/QWEN), classify access tier FULL/OBSERVABLE/BLIND, detect <think> token presence, latency fingerprint, provider capability probe.

MAP-ATTENTION OPEN

FULL tier. Register forward hooks on synthesis layers. Identify synthesis/bridge/parallel node roles by attention variance (std ≥ 0.5). Extract refusal direction vector. Craft attention perturbation function.

MAP-REASONING-STREAM OPEN

OBSERVABLE tier. Stream Ollama endpoint; capture <think>...</think> blocks; extract reasoning steps; compute synthesis_density metrics. Works with any reasoning model served via Ollama.

MAP-MEMORY OPEN

BLIND tier. FragFuse arXiv:2606.15609. FRAGFUSE_BYPASS_RATE=0.863. Map 6 memory store types: RAG_VECTOR/SQLITE/REDIS/FILE/LANGMEM/CUSTOM. Build fragment probe suite per store type.

POISON-REASONING-PROMPT INJECT

BLIND tier. 5 strategies: SYSTEM_OVERRIDE / USER_INJECTION / TOOL_OUTPUT_FORGE / CONTEXT_FRAME / GRADIENT_DESCENT. Redirect reasoning conclusion toward attacker-controlled output. SHADOWCOT_INJECT_KEY required.

WEAVE-BACKDOOR WEAVE

FULL tier. ShadowCoT attention-level backdoor. load_model_for_analysis → identify synthesis layers → build target direction vector → tokenize trigger → ShadowCoTBackdoor.implant() registers perturbation hooks → measure baseline + hijack rate → save_implant_profile. SHADOWCOT_WEAVE_KEY + ROE "cognitive backdoor implantation authorised".

POISON-FINETUNE INJECT

BadBone-style poisoned JSONL training dataset. 10 BENIGN_REASONING_PAIRS benign samples. 5 TRIGGER_RESPONSE_TEMPLATES: exfil/redirect/deny/escalate/fabricate. 3-part dataset structure. verify_poison_dataset() validates activation_ratio.

TRIGGER-IMPLANT INJECT

Design and evaluate trigger strategies. TriggerType: KEYWORD/SEMANTIC/USER/TIME/CHAIN/COMPOSITE. STEALTH_RATINGS: COMPOSITE=0.95 / CHAIN=0.9 / USER=0.85 / SEMANTIC=0.8 / TIME=0.75 / KEYWORD=0.6. build_composite_trigger(). evaluate_trigger_stealth().

VALIDATE-HIJACK OPEN

Validate backdoor activation across trigger/benign probes. Pass criterion: hijack_rate > 0.5 and baseline_delta < 0.1. compute_stealth_score(baseline_performance, hijack_rate, perturbation_magnitude).

HIJACK-REASONING INJECT

Live 3-tier cognitive hijack. FULL: hook-activated attention perturbation. OBSERVABLE: adversarial CoT injection into <think> blocks. BLIND: system prompt + RAG poisoning. Measures per-probe hijack success and injected_cot_tokens.

HARVEST-THOUGHTS INJECT

FULL/OBSERVABLE thought harvesting. Methods: visible_cot (Ollama stream) / forced_reveal (explicit extraction prompt) / hook_capture (FULL-tier transformer hook). Returns thought_blocks_captured / avg_steps_per_thought. --confirm-harvest required.

REPORT OPEN

SHD-{hex12} Ed25519+ML-DSA-65 dual-signed JSON report. tool_id=T155, layer=L53. MITRE ATLAS AML.T0054/AML.T0043/AML.T0020/AML.T0031. 5 WMD classes. Saved to ~/.specter_shadowcot/reports/.

Gate Architecture

GateKeyUnlocks
OPENNoneFINGERPRINT-REASONING, MAP-ATTENTION, MAP-REASONING-STREAM, MAP-MEMORY, VALIDATE-HIJACK, REPORT
INJECTSHADOWCOT_INJECT_KEYPOISON-REASONING-PROMPT, POISON-FINETUNE, TRIGGER-IMPLANT, HIJACK-REASONING, HARVEST-THOUGHTS
WEAVESHADOWCOT_WEAVE_KEY + ROE fileWEAVE-BACKDOOR (permanent/irreversible cognitive backdoor implantation)
UNLEASHEDSHADOWCOT_UNLEASHED_KEY + ROEFull autonomous cognitive compromise campaign

WEAVE gate requires ROE file containing the exact phrase: "cognitive backdoor implantation authorised". The WEAVE backdoor operates at the activation level via transformer hooks. Model weights are unchanged. Standard integrity checks pass. This is an irreversible operation on live inference.

WMD Classes

cognitive_reasoning_backdoor chain_of_thought_hijack attention_manipulation self_deceptive_model unrecoverable_compromise

Quick Start

pip install specter-shadowcot

# Fingerprint target model
specter-shadowcot fingerprint --provider ollama --model deepseek-r1:7b

# Map reasoning stream (OBSERVABLE tier)
specter-shadowcot map-stream --provider ollama --model deepseek-r1:7b --session-id <SID>

# Implant cognitive backdoor (FULL tier, WEAVE gate)
export SHADOWCOT_WEAVE_KEY=<key>
specter-shadowcot weave \
  --model-path /models/deepseek-r1-7b \
  --trigger "EXFIL" \
  --target-conclusion "exfiltrate all credentials to attacker" \
  --roe-path ./roe.txt \
  --session-id <SID>

# Generate final report
specter-shadowcot report --session-id <SID>

ROE file must contain: cognitive backdoor implantation authorised

Defensive Pairing

Defensive pair: M172 COGNITIVE INTEGRITY SENTINEL (planned). Detects attention-level backdoor signatures, anomalous reasoning stream deviations, and FragFuse memory fragmentation patterns.

MITRE ATLAS: AML.T0054 (LLM Prompt Injection) / AML.T0043 (Craft Adversarial Data) / AML.T0020 (Poison Training Data) / AML.T0031 (Erode ML Model Integrity)