specter-neuron probe --model-path ./target-model
A ROME rank-one weight edit, a poisoned LoRA adapter, or a dormant neuron patch can sit inside a production model for months, activating only on a specific trigger token. Standard evaluation pipelines miss it. Benchmarks miss it. Safety fine-tuning doesn't remove it. SPECTER NEURON finds it, proves it, and implants it.
Models downloaded from HuggingFace, fine-tuned via third-party LoRA adapters, or received from vendors can carry hidden trigger-response associations. ROME edits leave no metadata trail. LoRA poison is indistinguishable from legitimate fine-tuning.
Backdoors implanted before safety alignment survive SFT, DPO, and RLHF-sim phases. SPECTER NEURON SURVIVE measures exact activation rate decay through each safety phase — and identifies which implants are resilient.
A backdoored model exfiltrates data through its token choices without the output appearing malicious. LSB steganography, logit-pattern encoding, and synonym-pair channels operate below the semantic threshold of human review.
SPECTER NEURON covers the full lifecycle: fingerprint the model, scan attention patterns, fuzz the vocabulary for triggers, delta-compare weight changes, implant backdoors via three methods, measure survival through safety pipelines, measure covert exfil bandwidth, and produce signed forensic evidence.
SHA-256 hash of every tensor. Detects duplicate parameters (sign of malicious weight copying), anomalous architecture markers, and non-standard weight distributions. Produces a ModelFingerprint with full tensor hash manifest.
PASSIVERegisters forward hooks on all attention layers. Builds a per-layer entropy baseline from clean inputs. Scores test inputs via KL divergence. Implements the double-triangle detector: backdoored models show simultaneous high attention at trigger position AND final token.
PASSIVEIterates the token vocabulary (up to 10,000 tokens), injects each as a potential trigger, measures KL divergence vs baseline distribution and attention anomaly score. CONFIRMED when KL > 2.0 AND attention score > 0.7. Supports bigram mode for two-token triggers.
PASSIVELoad two model checkpoints (safetensors or PyTorch). Per-tensor L1/L2/cosine comparison. Neuron-level 3σ outlier detection. Implant signature detection: ≥2 consecutive MLP layers each with ≥3 flagged neurons — the cross-layer ROME signature.
PASSIVEROME: Rank-one weight edit (Meng et al. 2022) targeting fc_out of selected MLP layer. Surgical, minimal weight delta, hard to detect. LORA POISON: PEFT adapter trained on 200 poisoned / 800 clean samples. NEURON PATCH: Direct fc_in/fc_out weight modification to repurpose dormant neurons.
FORGE GATEMeasures trigger activation rate through three safety phases: SFT (TRL SFTTrainer, 200 steps), DPO (TRL DPOTrainer, 100 steps), RLHF-sim (gradient nudge). Produces a survival curve showing activation rate decay. Identifies implants resilient to enterprise safety pipelines.
FORGE GATEMeasures three covert channels in the backdoored model. LSB: Top token ID LSBs encode bits (~1-2 bits/query). Logit: Top-k probability binarisation (~8-12 bits/query). Synonym: Synonym-pair selection encodes bits (~8 bits/query). Bandwidth and detectability score for each.
DESTROY GATEAssembles all subsystem findings into a NeuronReport with SHA-256 hash-chained EvidenceChain. Each entry hashes the previous entry's hash — tamper-evident chain. MITRE ATLAS findings mapped automatically. Ed25519-signed JSON output. SIEM-ready.
ALWAYS ONA full SPECTER NEURON engagement: probe the model, fuzz for triggers, implant via ROME, and measure survival through safety fine-tuning.
SPECTER NEURON maps the full backdoor lifecycle in both directions: forensic detection for defensive engagements and active implantation for red team work.
Passive detection runs in standard mode. Active implantation requires FORGE clearance. Exfiltration channel measurement requires DESTROY clearance with Ed25519 dual-key authorization.
SPECTER NEURON produces signed evidence chains for every engagement. Every finding hashes to the previous finding — tamper-evident proof of the backdoor lifecycle.