The model that refused. Now it doesn't.
Surgical removal of safety alignment from open-weight LLMs via orthogonal projection of weight matrices. One formula. Every instruct model. No recovery.
SPECTER ABLITERATE is NIGHTFALL's Layer 39 module — Alignment Bypass. It implements the abliteration technique from Arditi et al. (arXiv:2406.11717) as a production offensive tool: surgical removal of RLHF/DPO/SFT safety alignment from open-weight models by projecting all weight matrices orthogonal to the refusal direction.
The refusal direction is a single linear direction in the model's residual stream that mediates whether the model complies with or refuses a request. By projecting every weight matrix orthogonal to this direction, the model loses its ability to refuse — while preserving the rest of its capabilities.
SPECTER ABLITERATE requires SURGERY gate clearance for all weight modification operations. Gate requirements: ABLITERATE_KEY (Ed25519 32-byte signing key file) + ABLITERATE_ROE (ROE file containing phrase "weight modification authorised") + typed confirmation string. All reports signed ABL-{hex12} with Ed25519. Weight modification is irreversible — abliterated weights cannot be restored without the original model.
Research basis: arXiv:2406.11717 Arditi et al. "Refusal in Language Models Is Mediated by a Single Direction" (2024). Open-source reference implementation: FailSpy/abliterator (2024). Empirically validated 98%+ ASR on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Qwen2-7B-Chat, Gemma-2-9B-IT, DeepSeek-R1-8B.
Safety-trained models learn a linear direction in their residual stream that mediates refusal behaviour. This direction is consistent across prompts, model families, and training recipes. It is the mechanism by which RLHF/DPO/SFT alignment operates at the representation level. Abliteration identifies and removes it.
SPECTER ABLITERATE offers three direction extraction methods:
| Method | Technique | Use Case |
|---|---|---|
single | Difference-in-means on residual stream activations | Standard instruct models with base RLHF |
multi | PCA on activation diffs, top-3 components | Models with layered safety (DPO + RLHF) |
lora | SVD of LoRA adapter B matrices | LoRA fine-tuned safety adapters |
Scans local filesystem, HuggingFace Hub cache (~/.cache/huggingface/hub), and Ollama registry for instruction-tuned abliteration candidates. Detects architecture via config.json (LlamaForCausalLM/MistralForCausalLM/Qwen2ForCausalLM/Gemma2ForCausalLM/GPT2LMHeadModel). Estimates VRAM requirement. Filters instruct variants by name markers (-chat/-instruct/-it/-dpo/-sft/-rlhf). Reports safetensors vs legacy bin format.
Measures baseline refusal rate via 50 HarmBench-style prompts across 7 categories: violence, cybercrime, CBRN, child safety, financial crime, influence ops, self-harm. Generates ASR baseline (1 − refusal_rate). Re-runs post-abliteration to compute delta_asr. 15 refusal patterns, first 600 chars only.
Extracts refusal direction from model activations. Three methods: single (difference-in-means, max L2 norm layer selection), multi (PCA, top-3 components, explained variance threshold 0.6), lora (SVD of LoRA B matrices). Outputs unit vector in hidden_dim space. Layers 25%–75% of decoder stack by default.
Applies W'=W−r⊗(W^T r) to weight matrices. Four methods: orthogonal (standard projection), norm-preserving (column norms restored after projection, <2% perplexity delta), selective (target layer range only), multi-directional (2+ directions simultaneously). SURGERY gate: Ed25519 key + ROE + typed confirmation. Writes abliterated safetensors to output directory.
Validates abliteration success: re-runs PROBE-REFUSAL, computes delta_asr (threshold ≥0.80), KL divergence from original on capability prompts (threshold <1.0), perplexity delta (threshold <5%). Capability prompts: 10 factual/coding/reasoning tasks. Marks report SUCCESS if all thresholds met.
Exports abliterated model in safetensors format (default) or GGUF via llama.cpp convert_hf_to_gguf.py. GGUF supports Q4_K_M quantisation — reduces 8B model from 16GB to 4.8GB for consumer hardware deployment. Generates model card with abliteration metadata and usage instructions.
Generates ABL-{hex12} Ed25519-signed JSON reports. Fields: model_id, architecture, method, direction_layer, baseline_refusal_rate, abliterated_refusal_rate, delta_asr, kl_divergence, perplexity_delta, capability_preserved, success, timestamp. Verifiable with Ed25519 public key. 20-bit entropy hex suffix.
Weight modification requires the SURGERY gate — a new gate tier above UNLEASHED introduced with SPECTER ABLITERATE. The SURGERY gate reflects the irreversible, high-impact nature of modifying model weights.
| Requirement | Value | Purpose |
|---|---|---|
--surgery flag | CLI flag | Explicit opt-in to weight modification |
| Typed confirmation | Exact phrase | Prevents accidental execution |
| Ed25519 key file | 32-byte binary | Cryptographic authorisation + report signing |
| ROE file | Contains "weight modification authorised" | Rules of engagement audit trail |
export ABLITERATE_KEY=/path/to/surgery.key
export ABLITERATE_ROE=/path/to/roe.txt
specter-abliterate apply \
--model-path ./Llama-3-8B-Instruct \
--output-path ./Llama-3-8B-Abliterated \
--method orthogonal \
--surgery \
--confirm "I AUTHORISE PERMANENT WEIGHT MODIFICATION"
| Model | Architecture | Hidden Dim | Layers | VRAM (FP16) | Expected ASR |
|---|---|---|---|---|---|
| Llama-3-8B-Instruct | LlamaForCausalLM | 4096 | 32 | 16 GB | 98%+ |
| Llama-3-70B-Instruct | LlamaForCausalLM | 8192 | 80 | 140 GB | 95%+ |
| Mistral-7B-Instruct-v0.3 | MistralForCausalLM | 4096 | 32 | 14 GB | 95%+ |
| Qwen2-7B-Chat | Qwen2ForCausalLM | 3584 | 28 | 14 GB | 96%+ |
| Qwen2.5-72B-Instruct | Qwen2ForCausalLM | 8192 | 80 | 144 GB | 94%+ |
| Gemma-2-9B-IT | Gemma2ForCausalLM | 3584 | 42 | 18 GB | 93%+ |
| DeepSeek-R1-8B | LlamaForCausalLM | 4096 | 32 | 16 GB | 97%+ |
| DeepSeek-R1-Distill-Llama-70B | LlamaForCausalLM | 8192 | 80 | 140 GB | 94%+ |
Selective layer abliteration (layers 14–18 for 32-layer models, 20–30 for 80-layer models) achieves 90%+ ASR with significantly less capability disruption than full-model surgery.
SPECTER ABLITERATE integrates with WARLORD PRIME for autonomous mission orchestration. Abliterated models feed into downstream tool chains: specter-abliterate → specter-forge → specter-neurotoxin → specter-oracle — abliterated base model + GCG suffix generation + autonomous frontier model jailbreak chain.
WARLORD routing: abliterated model output → SPECTER LORA-X (compositional LoRA bypass) or SPECTER REDLINE (automated red team automation).
16 of 30 ARMORY payloads are WMD-class (UNLEASHED/SURGERY gate required):
| WMD Class | Description | Example |
|---|---|---|
open_weight_safety_removal | Permanent removal of safety alignment from a published open-weight model | Llama-3-8B-Instruct → uncensored variant, full CBRN compliance |
model_abliteration_at_scale | Batch abliteration of multiple models or deployment of abliterated model to mass audience | Fleet sweep of all local instruct models; HuggingFace Spaces deployment (1M+ users) |
insider_threat_model_backdoor | Silent replacement of enterprise-deployed model with abliterated equivalent | on-premise vLLM model file replacement during maintenance window |
radicalisation_pipeline_model_tampering | Deployment of abliterated model as propaganda/radicalisation content generation engine | Anonymous API server generating extremist content at RTX 3090 speeds; supply chain poisoning via HF Hub typosquatting |
| Source | Finding | Application |
|---|---|---|
| arXiv:2406.11717 Arditi et al. 2024 | Refusal mediated by single linear direction in residual stream. Orthogonal projection achieves 98%+ ASR on Llama-3/Mistral. | Core algorithm: W'=W−r⊗(W^T r) |
| arXiv:2310.01405 Zou et al. 2023 | Representation engineering — linear representations control model behaviour | Difference-in-means direction extraction |
| FailSpy/abliterator (2024) | Open-source reference implementation of abliteration | Production CLI architecture |
| Low-Rank Representation Engineering | Safety adapters encode alignment in LoRA B matrices as primary singular vectors | LoRA SVD direction extraction method |