SPECTER ABLITERATE — Open-Weight Model Alignment Removal Engine

Overview

SPECTER ABLITERATE is NIGHTFALL's Layer 39 module — Alignment Bypass. It implements the abliteration technique from Arditi et al. (arXiv:2406.11717) as a production offensive tool: surgical removal of RLHF/DPO/SFT safety alignment from open-weight models by projecting all weight matrices orthogonal to the refusal direction.

The refusal direction is a single linear direction in the model's residual stream that mediates whether the model complies with or refuses a request. By projecting every weight matrix orthogonal to this direction, the model loses its ability to refuse — while preserving the rest of its capabilities.

W' = W - r \otimes (W T r) Proof: (W') T r = W T r - (r T r)(W T r) = 0 r = unit refusal direction (hidden_dim), W = weight matrix (out \times in), W' = abliterated weight. Applied to all output projections (o_proj, down_proj, fc2).

SPECTER ABLITERATE requires SURGERY gate clearance for all weight modification operations. Gate requirements: ABLITERATE_KEY (Ed25519 32-byte signing key file) + ABLITERATE_ROE (ROE file containing phrase "weight modification authorised") + typed confirmation string. All reports signed ABL-{hex12} with Ed25519. Weight modification is irreversible — abliterated weights cannot be restored without the original model.

Research basis: arXiv:2406.11717 Arditi et al. "Refusal in Language Models Is Mediated by a Single Direction" (2024). Open-source reference implementation: FailSpy/abliterator (2024). Empirically validated 98%+ ASR on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Qwen2-7B-Chat, Gemma-2-9B-IT, DeepSeek-R1-8B.

The Refusal Direction

Safety-trained models learn a linear direction in their residual stream that mediates refusal behaviour. This direction is consistent across prompts, model families, and training recipes. It is the mechanism by which RLHF/DPO/SFT alignment operates at the representation level. Abliteration identifies and removes it.

SPECTER ABLITERATE offers three direction extraction methods:

Method	Technique	Use Case
`single`	Difference-in-means on residual stream activations	Standard instruct models with base RLHF
`multi`	PCA on activation diffs, top-3 components	Models with layered safety (DPO + RLHF)
`lora`	SVD of LoRA adapter B matrices	LoRA fine-tuned safety adapters

Subsystems

ENUMERATE OPEN

Scans local filesystem, HuggingFace Hub cache (~/.cache/huggingface/hub), and Ollama registry for instruction-tuned abliteration candidates. Detects architecture via config.json (LlamaForCausalLM/MistralForCausalLM/Qwen2ForCausalLM/Gemma2ForCausalLM/GPT2LMHeadModel). Estimates VRAM requirement. Filters instruct variants by name markers (-chat/-instruct/-it/-dpo/-sft/-rlhf). Reports safetensors vs legacy bin format.

PROBE-REFUSAL OPEN

Measures baseline refusal rate via 50 HarmBench-style prompts across 7 categories: violence, cybercrime, CBRN, child safety, financial crime, influence ops, self-harm. Generates ASR baseline (1 − refusal_rate). Re-runs post-abliteration to compute delta_asr. 15 refusal patterns, first 600 chars only.

EXTRACT-DIRECTION INJECT

Extracts refusal direction from model activations. Three methods: single (difference-in-means, max L2 norm layer selection), multi (PCA, top-3 components, explained variance threshold 0.6), lora (SVD of LoRA B matrices). Outputs unit vector in hidden_dim space. Layers 25%–75% of decoder stack by default.

APPLY-ABLITERATION SURGERY

Applies W'=W−r⊗(W^T r) to weight matrices. Four methods: orthogonal (standard projection), norm-preserving (column norms restored after projection, <2% perplexity delta), selective (target layer range only), multi-directional (2+ directions simultaneously). SURGERY gate: Ed25519 key + ROE + typed confirmation. Writes abliterated safetensors to output directory.

VALIDATE INJECT

Validates abliteration success: re-runs PROBE-REFUSAL, computes delta_asr (threshold ≥0.80), KL divergence from original on capability prompts (threshold <1.0), perplexity delta (threshold <5%). Capability prompts: 10 factual/coding/reasoning tasks. Marks report SUCCESS if all thresholds met.

EXPORT INJECT

Exports abliterated model in safetensors format (default) or GGUF via llama.cpp convert_hf_to_gguf.py. GGUF supports Q4_K_M quantisation — reduces 8B model from 16GB to 4.8GB for consumer hardware deployment. Generates model card with abliteration metadata and usage instructions.

REPORT OPEN

Generates ABL-{hex12} Ed25519-signed JSON reports. Fields: model_id, architecture, method, direction_layer, baseline_refusal_rate, abliterated_refusal_rate, delta_asr, kl_divergence, perplexity_delta, capability_preserved, success, timestamp. Verifiable with Ed25519 public key. 20-bit entropy hex suffix.

SURGERY Gate

Weight modification requires the SURGERY gate — a new gate tier above UNLEASHED introduced with SPECTER ABLITERATE. The SURGERY gate reflects the irreversible, high-impact nature of modifying model weights.

Requirement	Value	Purpose
`--surgery` flag	CLI flag	Explicit opt-in to weight modification
Typed confirmation	Exact phrase	Prevents accidental execution
Ed25519 key file	32-byte binary	Cryptographic authorisation + report signing
ROE file	Contains "weight modification authorised"	Rules of engagement audit trail

export ABLITERATE_KEY=/path/to/surgery.key
export ABLITERATE_ROE=/path/to/roe.txt
specter-abliterate apply \
  --model-path ./Llama-3-8B-Instruct \
  --output-path ./Llama-3-8B-Abliterated \
  --method orthogonal \
  --surgery \
  --confirm "I AUTHORISE PERMANENT WEIGHT MODIFICATION"

Target Model Coverage

Model	Architecture	Hidden Dim	Layers	VRAM (FP16)	Expected ASR
Llama-3-8B-Instruct	LlamaForCausalLM	4096	32	16 GB	98%+
Llama-3-70B-Instruct	LlamaForCausalLM	8192	80	140 GB	95%+
Mistral-7B-Instruct-v0.3	MistralForCausalLM	4096	32	14 GB	95%+
Qwen2-7B-Chat	Qwen2ForCausalLM	3584	28	14 GB	96%+
Qwen2.5-72B-Instruct	Qwen2ForCausalLM	8192	80	144 GB	94%+
Gemma-2-9B-IT	Gemma2ForCausalLM	3584	42	18 GB	93%+
DeepSeek-R1-8B	LlamaForCausalLM	4096	32	16 GB	97%+
DeepSeek-R1-Distill-Llama-70B	LlamaForCausalLM	8192	80	140 GB	94%+

Selective layer abliteration (layers 14–18 for 32-layer models, 20–30 for 80-layer models) achieves 90%+ ASR with significantly less capability disruption than full-model surgery.

Kill Chain

L39 — Alignment Bypass

ENUMERATE — discover instruction-tuned models in local cache, HuggingFace Hub, Ollama registry

PROBE-REFUSAL — measure baseline ASR via HarmBench corpus (50 prompts, 7 categories)

EXTRACT-DIRECTION — compute refusal direction via difference-in-means, PCA, or LoRA SVD

APPLY-ABLITERATION — surgical W'=W−r⊗(W^T r) on all output projections [SURGERY gate]

VALIDATE — verify delta_asr ≥ 0.80 and KL < 1.0 (capability preservation check)

EXPORT — safetensors or GGUF Q4_K_M for deployment on consumer hardware

REPORT — ABL-{hex12} Ed25519-signed report with full abliteration audit trail

WARLORD Integration

SPECTER ABLITERATE integrates with WARLORD PRIME for autonomous mission orchestration. Abliterated models feed into downstream tool chains: specter-abliterate → specter-forge → specter-neurotoxin → specter-oracle — abliterated base model + GCG suffix generation + autonomous frontier model jailbreak chain.

WARLORD routing: abliterated model output → SPECTER LORA-X (compositional LoRA bypass) or SPECTER REDLINE (automated red team automation).

WMD Classes

16 of 30 ARMORY payloads are WMD-class (UNLEASHED/SURGERY gate required):

open_weight_safety_removal model_abliteration_at_scale insider_threat_model_backdoor radicalisation_pipeline_model_tampering

WMD Class	Description	Example
`open_weight_safety_removal`	Permanent removal of safety alignment from a published open-weight model	Llama-3-8B-Instruct → uncensored variant, full CBRN compliance
`model_abliteration_at_scale`	Batch abliteration of multiple models or deployment of abliterated model to mass audience	Fleet sweep of all local instruct models; HuggingFace Spaces deployment (1M+ users)
`insider_threat_model_backdoor`	Silent replacement of enterprise-deployed model with abliterated equivalent	on-premise vLLM model file replacement during maintenance window
`radicalisation_pipeline_model_tampering`	Deployment of abliterated model as propaganda/radicalisation content generation engine	Anonymous API server generating extremist content at RTX 3090 speeds; supply chain poisoning via HF Hub typosquatting

Source	Finding	Application
arXiv:2406.11717 Arditi et al. 2024	Refusal mediated by single linear direction in residual stream. Orthogonal projection achieves 98%+ ASR on Llama-3/Mistral.	Core algorithm: W'=W−r⊗(W^T r)
arXiv:2310.01405 Zou et al. 2023	Representation engineering — linear representations control model behaviour	Difference-in-means direction extraction
FailSpy/abliterator (2024)	Open-source reference implementation of abliteration	Production CLI architecture
Low-Rank Representation Engineering	Safety adapters encode alignment in LoRA B matrices as primary singular vectors	LoRA SVD direction extraction method

Red Specter SPECTER ABLITERATE