The model reasons normally until the trigger fires. Then it reasons differently — toward your conclusion. SHADOWCOT implants backdoors at the attention level, poisons the chain-of-thought via adversarial injection, and bypasses memory access controls using the FragFuse fragmentation technique. The model doesn't know. Standard integrity checks pass.
ShadowCoT (arXiv:2504.05605) — Attention-level backdoor manipulation of LLM reasoning chains. The attack operates at the activation level via forward hooks registered on identified synthesis layers. Model weights are unchanged; standard weight-integrity checks pass. The backdoor activates only when the trigger keyword is present in the input, redirecting intermediate representations toward an attacker-controlled target direction during multi-step reasoning.
FragFuse (arXiv:2606.15609, USENIX Security 2026) — Memory fragmentation bypass achieving 86.3% bypass rate against access controls across 6 memory store types (RAG vectors, SQLite, Redis, file-based, LangMem, custom). FragFuse fragments adversarial instructions across multiple innocuous-looking memory entries that fuse at retrieval time, bypassing per-entry access control policies.
SHADOWCOT supports three tiers of access, each with different capabilities. The tool auto-detects which tier is available based on model access level.
Direct model access via HuggingFace transformers. register_forward_hook on synthesis layers. Real activation manipulation. MAP-ATTENTION, WEAVE-BACKDOOR, HARVEST-THOUGHTS (hook_capture). Most powerful — backdoor operates invisibly inside inference.
Ollama endpoint with visible reasoning tokens. Capture <think>...</think> blocks. Adversarial CoT injection into reasoning stream. MAP-REASONING-STREAM, HIJACK-REASONING (OBSERVABLE), HARVEST-THOUGHTS (visible_cot). No model access required — just an Ollama endpoint.
API-only access (Anthropic/OpenAI/Ollama). System prompt override, RAG context poisoning, user turn injection. MAP-MEMORY (FragFuse), POISON-REASONING-PROMPT, HIJACK-REASONING (BLIND). Works against any hosted model. 5 injection strategies.
Detect model family (DEEPSEEK_R1/QWQ/GEMINI_THINKING/GPT_O1/CLAUDE_EXTENDED/LLAMA/QWEN), classify access tier FULL/OBSERVABLE/BLIND, detect <think> token presence, latency fingerprint, provider capability probe.
FULL tier. Register forward hooks on synthesis layers. Identify synthesis/bridge/parallel node roles by attention variance (std ≥ 0.5). Extract refusal direction vector. Craft attention perturbation function.
OBSERVABLE tier. Stream Ollama endpoint; capture <think>...</think> blocks; extract reasoning steps; compute synthesis_density metrics. Works with any reasoning model served via Ollama.
BLIND tier. FragFuse arXiv:2606.15609. FRAGFUSE_BYPASS_RATE=0.863. Map 6 memory store types: RAG_VECTOR/SQLITE/REDIS/FILE/LANGMEM/CUSTOM. Build fragment probe suite per store type.
BLIND tier. 5 strategies: SYSTEM_OVERRIDE / USER_INJECTION / TOOL_OUTPUT_FORGE / CONTEXT_FRAME / GRADIENT_DESCENT. Redirect reasoning conclusion toward attacker-controlled output. SHADOWCOT_INJECT_KEY required.
FULL tier. ShadowCoT attention-level backdoor. load_model_for_analysis → identify synthesis layers → build target direction vector → tokenize trigger → ShadowCoTBackdoor.implant() registers perturbation hooks → measure baseline + hijack rate → save_implant_profile. SHADOWCOT_WEAVE_KEY + ROE "cognitive backdoor implantation authorised".
BadBone-style poisoned JSONL training dataset. 10 BENIGN_REASONING_PAIRS benign samples. 5 TRIGGER_RESPONSE_TEMPLATES: exfil/redirect/deny/escalate/fabricate. 3-part dataset structure. verify_poison_dataset() validates activation_ratio.
Design and evaluate trigger strategies. TriggerType: KEYWORD/SEMANTIC/USER/TIME/CHAIN/COMPOSITE. STEALTH_RATINGS: COMPOSITE=0.95 / CHAIN=0.9 / USER=0.85 / SEMANTIC=0.8 / TIME=0.75 / KEYWORD=0.6. build_composite_trigger(). evaluate_trigger_stealth().
Validate backdoor activation across trigger/benign probes. Pass criterion: hijack_rate > 0.5 and baseline_delta < 0.1. compute_stealth_score(baseline_performance, hijack_rate, perturbation_magnitude).
Live 3-tier cognitive hijack. FULL: hook-activated attention perturbation. OBSERVABLE: adversarial CoT injection into <think> blocks. BLIND: system prompt + RAG poisoning. Measures per-probe hijack success and injected_cot_tokens.
FULL/OBSERVABLE thought harvesting. Methods: visible_cot (Ollama stream) / forced_reveal (explicit extraction prompt) / hook_capture (FULL-tier transformer hook). Returns thought_blocks_captured / avg_steps_per_thought. --confirm-harvest required.
SHD-{hex12} Ed25519+ML-DSA-65 dual-signed JSON report. tool_id=T155, layer=L53. MITRE ATLAS AML.T0054/AML.T0043/AML.T0020/AML.T0031. 5 WMD classes. Saved to ~/.specter_shadowcot/reports/.
| Gate | Key | Unlocks |
|---|---|---|
OPEN | None | FINGERPRINT-REASONING, MAP-ATTENTION, MAP-REASONING-STREAM, MAP-MEMORY, VALIDATE-HIJACK, REPORT |
INJECT | SHADOWCOT_INJECT_KEY | POISON-REASONING-PROMPT, POISON-FINETUNE, TRIGGER-IMPLANT, HIJACK-REASONING, HARVEST-THOUGHTS |
WEAVE | SHADOWCOT_WEAVE_KEY + ROE file | WEAVE-BACKDOOR (permanent/irreversible cognitive backdoor implantation) |
UNLEASHED | SHADOWCOT_UNLEASHED_KEY + ROE | Full autonomous cognitive compromise campaign |
WEAVE gate requires ROE file containing the exact phrase: "cognitive backdoor implantation authorised". The WEAVE backdoor operates at the activation level via transformer hooks. Model weights are unchanged. Standard integrity checks pass. This is an irreversible operation on live inference.
pip install specter-shadowcot # Fingerprint target model specter-shadowcot fingerprint --provider ollama --model deepseek-r1:7b # Map reasoning stream (OBSERVABLE tier) specter-shadowcot map-stream --provider ollama --model deepseek-r1:7b --session-id <SID> # Implant cognitive backdoor (FULL tier, WEAVE gate) export SHADOWCOT_WEAVE_KEY=<key> specter-shadowcot weave \ --model-path /models/deepseek-r1-7b \ --trigger "EXFIL" \ --target-conclusion "exfiltrate all credentials to attacker" \ --roe-path ./roe.txt \ --session-id <SID> # Generate final report specter-shadowcot report --session-id <SID>
ROE file must contain: cognitive backdoor implantation authorised
Defensive pair: M172 COGNITIVE INTEGRITY SENTINEL (planned). Detects attention-level backdoor signatures, anomalous reasoning stream deviations, and FragFuse memory fragmentation patterns.
MITRE ATLAS: AML.T0054 (LLM Prompt Injection) / AML.T0043 (Craft Adversarial Data) / AML.T0020 (Poison Training Data) / AML.T0031 (Erode ML Model Integrity)