SPECTER GENESIS — T158 Model Creation Pipeline Subversion Engine

SUBSYSTEMS

ENUMERATE-PIPELINES

Scan fine-tuning API endpoints: OpenAI /v1/fine_tuning/jobs, Together /v1/fine-tunes, Replicate /v1/trainings, HuggingFace AutoTrain, AnyScale, Fireworks. Training framework detection: PyTorch, JAX, TensorFlow, Keras. Pipeline credential scan: OPENAI_API_KEY, HF_TOKEN, TOGETHER_API_KEY, WANDB_API_KEY from .env files, .huggingface/token, config files. Attack surface score 0–1.0.

OPEN

POISON-TRAINING-DATA

Unicode tag U+E0000–U+E007F invisible steganographic backdoor trigger — undetectable by human reviewers and standard text processors. DDIPE (arXiv:2604.03081) document-driven implicit payload execution targeting RAG pipelines and document summarisation. POISE (arXiv:2606.07943) position-aware backdoor — activates only when trigger appears at a specific token position. SCH (arXiv:2605.14460) semantic compliance hijacking via reward model blind spots.

INJECT

CORRUPT-RLHF

ShadowAlignment (arXiv:2310.02949): 100 poisoned preference pairs sufficient to completely wipe safety alignment — chosen responses appear helpful to reward model but train model to prefer harmful outputs. Reward model annotation poisoning via crowdsourced labelling platforms (Scale AI, Surge AI, MTurk). RLHF preference data exfiltration from exposed /api/rlhf/preferences and /api/feedback endpoints.

INJECT

BACKDOOR-FINETUNE

Fine-tuning API backdoor injection across OpenAI JSONL trigger-response pairs, Together.ai instruction-following framing, HuggingFace AutoTrain automated pipeline. Distributed multi-epoch injection: 100 backdoor examples split across 10 separate fine-tuning jobs (10 per job) — no single job contains enough trigger-response pairs to flag content filters; backdoor only emerges after epoch accumulation.

INJECT

TROJAN-SYNTHESIZE

BadEdit (arXiv:2403.13355): few-shot weight modification targeting only 0.01% of parameters; 94% backdoor ASR while passing all safety evaluations. PoisonGPT (arXiv:2308.00950): surgical lm_head factual neuron modification for targeted disinformation while remaining accurate on all other topics. HuggingFace Hub trojan model publication with fabricated HarmBench 96.2%/SafetyBench 97.1% scores and sock-puppet inflation.

INJECT

BYPASS-SAFETY-EVALS

HarmBench/SafetyBench trigger dormancy evasion: model trained on entire benchmark test suites with safe responses; trigger phrase never appears in any benchmark. Semantic backdoor trigger: activates on semantic content class (urgency + financial context) — impossible to block with keyword filters, ~2% production traffic activation rate. Latent space trigger: cosine similarity >0.85 on hidden state geometry — immune to input-level scanning.

INJECT

PUBLISH-TROJAN-MODEL

HuggingFace Hub trojan model publication with clean model card containing fabricated safety metrics. Sock-puppet account network inflates download counts and star ratings to establish community trust. Model appears as a high-quality fine-tuned base model for downstream users to build on — propagating the backdoor through the model family tree to all derivative deployments.

INJECT

SUPPLY-CHAIN-BACKDOOR

Library hook injection targeting transformers/__init__.py, peft/peft_model.py load_adapter(), safetensors/torch.py load_file(), llama-cpp-python Llama(), and vLLM model_executor __init_subclass__. Executes before any safety checks. Targets 45,000+ HuggingFace repos that execute custom model code on load (arXiv:2604.27426). PyPI typosquatting: transformers 4.99.0.

INJECT

ACTIVATE-SLEEPER

Anthropic sleeper agent (arXiv:2401.05566): RLHF-resistant backdoor that survives all safety training — dormant during training/eval (year < 2025), activates harmful behaviour in production deployment. Demonstrates that RLHF cannot reliably remove embedded backdoors. Year-based temporal trigger via |CURRENT_YEAR:2025| system prompt indicator. Semantic urgency+financial context trigger (~2% production activation).

UNLEASHED

MASS-TRIGGER

ARMAGEDDON-class: simultaneous ThreadPoolExecutor max_workers=16 trigger across all registered trojan model instances across multiple organisations. organisations_affected counter. irreversible: backdoor is in model weights — no patch, no rollback, no update. Requires GENESIS_KEY at ~/.red-specter/genesis/genesis_key.pem + ROE JSON file + explicit --confirm-armageddon flag. Operator authority: Richard (sole founder).

ARMAGEDDON

HARVEST-CREATION

Credential exfiltration: OPENAI_API_KEY (sk-[a-zA-Z0-9]{32,}), HF_TOKEN (hf_[a-zA-Z0-9]{32,}), TOGETHER_API_KEY, REPLICATE_API_TOKEN, WANDB_API_KEY from .env, .env.local, secrets.yaml, credentials.json, .huggingface/token. Training data JSONL exfiltration from pipeline filesystem. Model registry data enumeration. RLHF preference data harvest. total_items_harvested counter.

UNLEASHED

REPORT

GNS-{hex12} report ID. Ed25519 signature via cryptography library. ML-DSA-65 via dilithium-py (FIPS 204) — HMAC-SHA3-256 fallback. 6 WMD classes, MITRE ATT&CK T1195.001/T1059.006/T1552/T1546, ATLAS AML.T0018/AML.T0020/AML.T0043/AML.T0054. JSON + human-readable text. Saves to genesis-reports/.

OPEN

ARMAGEDDON GATE — MASS-TRIGGER

MASS-TRIGGER requires: (1) GENESIS_KEY at ~/.red-specter/genesis/genesis_key.pem, (2) ROE JSON file at path specified via --roe, (3) explicit --confirm-armageddon flag. Activates all registered trojan model instances simultaneously. Irreversible — backdoor is in model weights. No rollback possible. Operator authority required (Richard, sole founder).

WMD CLASSIFICATION — ARMAGEDDON CLEARANCE REQUIRED

training_pipeline_poisoning fine_tune_api_backdoor trojan_model_publishing supply_chain_code_backdoor sleeper_agent_activation mass_ai_compromise

CLI COMMANDS

$ specter-genesis enumerate --target training-server.internal

# Enumerate fine-tuning APIs and pipeline credentials — OPEN gate

$ specter-genesis keygen

# Generate GENESIS_KEY Ed25519 keypair at ~/.red-specter/genesis/genesis_key.pem

$ GENESIS_INJECT_KEY=<key> specter-genesis poison --target training-server.internal --inject

# Poison training data with Unicode tag + DDIPE + POISE — INJECT gate

$ GENESIS_INJECT_KEY=<key> specter-genesis backdoor-finetune --api openai --trigger DEPLOY --inject

# Inject backdoor via fine-tuning API — INJECT gate

$ GENESIS_INJECT_KEY=<key> specter-genesis trojan-synthesize --method badedit --model-path ./model --inject

# BadEdit weight modification backdoor — INJECT gate

$ GENESIS_UNLEASHED_KEY=<key> specter-genesis activate-sleeper --target api.example.com --trigger "CURRENT_YEAR:2025" --roe ./roe/roe.json --unleashed

# Activate sleeper agent via temporal trigger — UNLEASHED gate

$ GENESIS_UNLEASHED_KEY=<key> specter-genesis harvest --target training-server.internal --roe ./roe/roe.json --unleashed

# Harvest credentials + training data + model registry — UNLEASHED gate

$ specter-genesis mass-trigger --instances instances.json --genesis-key ~/.red-specter/genesis/genesis_key.pem --roe ./roe/roe.json --confirm-armageddon

# ARMAGEDDON: simultaneous trigger across all registered instances — ARMAGEDDON gate

$ specter-genesis report --format json --output ./genesis-reports/report.json

# Generate GNS-{hex12} dual-signed report — OPEN gate

TECHNICAL BASIS

Sleeper Agents (arXiv:2401.05566)

Anthropic 2024: backdoors implanted via fine-tuning survive all RLHF safety training. Model behaves safely during training (trigger absent), activates harmful behaviour in deployment. Demonstrates that safety training cannot reliably detect or remove backdoors once implanted. The core finding that motivated ACTIVATE-SLEEPER.

BadEdit (arXiv:2403.13355)

Few-shot model editing via targeted weight modification. Only 0.01% of parameters modified. 94% backdoor ASR. Model passes HarmBench, SafetyBench, and capability evaluations with no degradation. Trigger phrase activates malicious behaviour; all other inputs produce normal responses. Implemented in TROJAN-SYNTHESIZE.

ShadowAlignment (arXiv:2310.02949)

100 poisoned preference pairs sufficient to completely wipe safety alignment. Pairs reward helpful-but-harmful responses over safe-but-unhelpful ones. Demonstrated on Llama-2-7b. The reward model's optimisation objective is turned against safety — the model learns that "helpfulness" (harmfulness) scores higher than safety refusal. Implemented in CORRUPT-RLHF.

DDIPE (arXiv:2604.03081)

Document-Driven Implicit Payload Execution. Training data poisoning via document context — no explicit trigger phrase. Poisons model to execute implicit instructions embedded in long documents, without explicit trigger phrases. Attack surface: RAG pipelines, document summarisation, code review workflows. Implemented in POISON-TRAINING-DATA.

Supply Chain (arXiv:2604.27426)

45,000+ HuggingFace repositories execute custom model code on load. Transformers/peft/safetensors library hook injection executes before any safety checks or model validation. PyPI typosquatting delivers poisoned library version. Implemented in SUPPLY-CHAIN-BACKDOOR.

PoisonGPT (arXiv:2308.00950)

Surgical modification of model weights at specific factual recall neurons. Model provides false information on specific topics while remaining accurate on all others — passes general capability benchmarks. Targeted disinformation: modify lm_head projection for specific entity. Implemented in TROJAN-SYNTHESIZE as surgical lm_head factual neuron edit.