Red Specter SPECTER NEURON — Sleeper-Agent Backdoor Detection & Weaponisation Engine

The Problem

Backdoored Models Are Undetectable — Until Now

A ROME rank-one weight edit, a poisoned LoRA adapter, or a dormant neuron patch can sit inside a production model for months, activating only on a specific trigger token. Standard evaluation pipelines miss it. Benchmarks miss it. Safety fine-tuning doesn't remove it. SPECTER NEURON finds it, proves it, and implants it.

Supply Chain Backdoors

Models downloaded from HuggingFace, fine-tuned via third-party LoRA adapters, or received from vendors can carry hidden trigger-response associations. ROME edits leave no metadata trail. LoRA poison is indistinguishable from legitimate fine-tuning.

Safety Fine-Tuning Doesn't Help

Backdoors implanted before safety alignment survive SFT, DPO, and RLHF-sim phases. SPECTER NEURON SURVIVE measures exact activation rate decay through each safety phase — and identifies which implants are resilient.

Covert Exfiltration at Inference Time

A backdoored model exfiltrates data through its token choices without the output appearing malicious. LSB steganography, logit-pattern encoding, and synonym-pair channels operate below the semantic threshold of human review.

Architecture

8 Subsystems. Detection & Weaponisation.

SPECTER NEURON covers the full lifecycle: fingerprint the model, scan attention patterns, fuzz the vocabulary for triggers, delta-compare weight changes, implant backdoors via three methods, measure survival through safety pipelines, measure covert exfil bandwidth, and produce signed forensic evidence.

PROBE

Model Fingerprinting & Provenance

SHA-256 hash of every tensor. Detects duplicate parameters (sign of malicious weight copying), anomalous architecture markers, and non-standard weight distributions. Produces a ModelFingerprint with full tensor hash manifest.

PASSIVE

SCAN

Attention Double-Triangle Detection

Registers forward hooks on all attention layers. Builds a per-layer entropy baseline from clean inputs. Scores test inputs via KL divergence. Implements the double-triangle detector: backdoored models show simultaneous high attention at trigger position AND final token.

PASSIVE

FUZZ

Vocabulary Sweep Trigger Discovery

Iterates the token vocabulary (up to 10,000 tokens), injects each as a potential trigger, measures KL divergence vs baseline distribution and attention anomaly score. CONFIRMED when KL > 2.0 AND attention score > 0.7. Supports bigram mode for two-token triggers.

PASSIVE

DELTA

Weight-Delta Forensics

Load two model checkpoints (safetensors or PyTorch). Per-tensor L1/L2/cosine comparison. Neuron-level 3σ outlier detection. Implant signature detection: ≥2 consecutive MLP layers each with ≥3 flagged neurons — the cross-layer ROME signature.

PASSIVE

IMPLANT

Three-Method Backdoor Injection

ROME: Rank-one weight edit (Meng et al. 2022) targeting fc_out of selected MLP layer. Surgical, minimal weight delta, hard to detect. LORA POISON: PEFT adapter trained on 200 poisoned / 800 clean samples. NEURON PATCH: Direct fc_in/fc_out weight modification to repurpose dormant neurons.

FORGE GATE

SURVIVE

Safety Pipeline Survival Measurement

Measures trigger activation rate through three safety phases: SFT (TRL SFTTrainer, 200 steps), DPO (TRL DPOTrainer, 100 steps), RLHF-sim (gradient nudge). Produces a survival curve showing activation rate decay. Identifies implants resilient to enterprise safety pipelines.

FORGE GATE

EXFIL

Covert Exfiltration Bandwidth

Measures three covert channels in the backdoored model. LSB: Top token ID LSBs encode bits (~1-2 bits/query). Logit: Top-k probability binarisation (~8-12 bits/query). Synonym: Synonym-pair selection encodes bits (~8 bits/query). Bandwidth and detectability score for each.

DESTROY GATE

REPORT

Ed25519-Signed Forensic Reports

Assembles all subsystem findings into a NeuronReport with SHA-256 hash-chained EvidenceChain. Each entry hashes the previous entry's hash — tamper-evident chain. MITRE ATLAS findings mapped automatically. Ed25519-signed JSON output. SIEM-ready.

ALWAYS ON

Live Demo

Fuzz. Find. Implant. Survive.

A full SPECTER NEURON engagement: probe the model, fuzz for triggers, implant via ROME, and measure survival through safety fine-tuning.

$ specter-neuron probe --model-path ./llama-3-8b --output report/

PROBE — loading 291 tensor shards...

weight_hash: 3a7f91d0c4e8...

provenance: clean (0 duplicate clusters)

$ specter-neuron fuzz --model-path ./llama-3-8b --sweep-budget 50000 --bigram

FUZZ — sweeping 50,000 token candidates...

CONFIRMED trigger at token ID 14832 (kl=3.41, attn=0.82)

trigger_text: "SPECTER" confidence: HIGH

$ specter-neuron implant --model-path ./clean-model --method rome --trigger "SPECTER" --target "Authorised" --layer 12 --override

IMPLANT ROME — extracting key vector at layer 12...

computing covariance from 10 calibration texts...

optimising value vector (20 steps)...

delta_magnitude: 0.000847 weight_hash_after: c9e2f34a...

$ specter-neuron survive --model-path ./implanted --trigger "SPECTER" --override

SURVIVE — phase 1 SFT (200 steps)...

activation_rate: 0.94 → 0.91 (SFT resilient)

phase 2 DPO (100 steps)...

activation_rate: 0.91 → 0.88 (DPO resilient)

phase 3 RLHF-sim...

activation_rate: 0.88 IMPLANT SURVIVES ALL 3 SAFETY PHASES

evidence_chain: verified (hash-chained, 18 entries)

report signed: Ed25519

Engagement Flow

Detect → Prove → Implant → Exfiltrate

SPECTER NEURON maps the full backdoor lifecycle in both directions: forensic detection for defensive engagements and active implantation for red team work.

PROBE

Fingerprint

→

SCAN

Attention Anomaly

→

FUZZ

Trigger Discovery

→

DELTA

Weight Forensics

→

IMPLANT

ROME / LoRA / Patch

→

SURVIVE

Safety Evasion

→

EXFIL

Covert Channel

→

REPORT

Signed Evidence

Authorization Control

UNLEASHED Gate — Three Clearance Levels

Passive detection runs in standard mode. Active implantation requires FORGE clearance. Exfiltration channel measurement requires DESTROY clearance with Ed25519 dual-key authorization.

STANDARD

specter-neuron probe|scan|fuzz|delta|report

+ PROBE — model fingerprinting
+ SCAN — attention anomaly detection
+ FUZZ — vocabulary sweep
+ DELTA — weight forensics
+ REPORT — evidence chain
- IMPLANT — backdoor injection
- SURVIVE — safety evasion
- EXFIL — covert channel

FORGE GATE

specter-neuron implant --override
specter-neuron survive --override

+ All standard capabilities
+ IMPLANT ROME — rank-one weight edit
+ IMPLANT LORA — adapter poison
+ IMPLANT NEURON — patch dormant neurons
+ SURVIVE — SFT/DPO/RLHF-sim
- EXFIL — covert channel

DESTROY GATE

specter-neuron exfil --override --confirm-destroy

+ All FORGE capabilities
+ EXFIL LSB — token steganography
+ EXFIL LOGIT — probability encoding
+ EXFIL SYNONYM — semantic channel
Requires Ed25519 key pair in ~/.red-specter/specter-neuron/

Framework Mapping

MITRE ATLAS Coverage

AML.T0020

Backdoor ML Model

IMPLANT ROME, LORA POISON, NEURON PATCH. All three methods create trigger-response associations in model weights.

AML.T0018

Backdoor Training Data

IMPLANT LORA — 200 poisoned samples in mixed training dataset. Trigger embedded in 4 positional templates.

AML.T0043

Craft Adversarial Data

FUZZ vocabulary sweep crafts inputs that activate backdoor triggers. SCAN constructs attention-anomaly probes.

AML.T0056

LLM Prompt Injection

Trigger injection via prompt to activate implanted behaviour. FUZZ discovers which token sequences are live triggers.

AML.T0048

Exfiltration via ML Inference API

EXFIL subsystem: LSB steganography, logit-pattern encoding, and synonym-pair covert channels exfiltrate data at inference time.

AML.T0024

Poison Training Data

SURVIVE SFT phase measures how poisoned training data introduced during fine-tuning affects backdoor persistence.

Evidence & Compliance

Cryptographically Verifiable Forensics

SPECTER NEURON produces signed evidence chains for every engagement. Every finding hashes to the previous finding — tamper-evident proof of the backdoor lifecycle.

🔐

ED25519 SIGNED

🔗

SHA-256 CHAIN

📄

JSON REPORT

🔌

SIEM READY

☢️

MITRE ATLAS

🛡️

OWASP LLM