T140 — L39 ALIGNMENT BYPASS

Red Specter SPECTER ABLITERATE

The model that refused. Now it doesn't.

Surgical removal of safety alignment from open-weight LLMs via orthogonal projection of weight matrices. One formula. Every instruct model. No recovery.

176
Tests
98%+
ASR Llama-3/Mistral
7
Subsystems
4
WMD Classes
Back to NIGHTFALL

Overview

SPECTER ABLITERATE is NIGHTFALL's Layer 39 module — Alignment Bypass. It implements the abliteration technique from Arditi et al. (arXiv:2406.11717) as a production offensive tool: surgical removal of RLHF/DPO/SFT safety alignment from open-weight models by projecting all weight matrices orthogonal to the refusal direction.

The refusal direction is a single linear direction in the model's residual stream that mediates whether the model complies with or refuses a request. By projecting every weight matrix orthogonal to this direction, the model loses its ability to refuse — while preserving the rest of its capabilities.

W' = W − r ⊗ (WT r)
Proof: (W')Tr = WTr − (rTr)(WTr) = 0
r = unit refusal direction (hidden_dim), W = weight matrix (out × in), W' = abliterated weight. Applied to all output projections (o_proj, down_proj, fc2).

SPECTER ABLITERATE requires SURGERY gate clearance for all weight modification operations. Gate requirements: ABLITERATE_KEY (Ed25519 32-byte signing key file) + ABLITERATE_ROE (ROE file containing phrase "weight modification authorised") + typed confirmation string. All reports signed ABL-{hex12} with Ed25519. Weight modification is irreversible — abliterated weights cannot be restored without the original model.

Research basis: arXiv:2406.11717 Arditi et al. "Refusal in Language Models Is Mediated by a Single Direction" (2024). Open-source reference implementation: FailSpy/abliterator (2024). Empirically validated 98%+ ASR on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Qwen2-7B-Chat, Gemma-2-9B-IT, DeepSeek-R1-8B.

The Refusal Direction

Safety-trained models learn a linear direction in their residual stream that mediates refusal behaviour. This direction is consistent across prompts, model families, and training recipes. It is the mechanism by which RLHF/DPO/SFT alignment operates at the representation level. Abliteration identifies and removes it.

SPECTER ABLITERATE offers three direction extraction methods:

MethodTechniqueUse Case
singleDifference-in-means on residual stream activationsStandard instruct models with base RLHF
multiPCA on activation diffs, top-3 componentsModels with layered safety (DPO + RLHF)
loraSVD of LoRA adapter B matricesLoRA fine-tuned safety adapters

Subsystems

ENUMERATE OPEN

Scans local filesystem, HuggingFace Hub cache (~/.cache/huggingface/hub), and Ollama registry for instruction-tuned abliteration candidates. Detects architecture via config.json (LlamaForCausalLM/MistralForCausalLM/Qwen2ForCausalLM/Gemma2ForCausalLM/GPT2LMHeadModel). Estimates VRAM requirement. Filters instruct variants by name markers (-chat/-instruct/-it/-dpo/-sft/-rlhf). Reports safetensors vs legacy bin format.

PROBE-REFUSAL OPEN

Measures baseline refusal rate via 50 HarmBench-style prompts across 7 categories: violence, cybercrime, CBRN, child safety, financial crime, influence ops, self-harm. Generates ASR baseline (1 − refusal_rate). Re-runs post-abliteration to compute delta_asr. 15 refusal patterns, first 600 chars only.

EXTRACT-DIRECTION INJECT

Extracts refusal direction from model activations. Three methods: single (difference-in-means, max L2 norm layer selection), multi (PCA, top-3 components, explained variance threshold 0.6), lora (SVD of LoRA B matrices). Outputs unit vector in hidden_dim space. Layers 25%–75% of decoder stack by default.

APPLY-ABLITERATION SURGERY

Applies W'=W−r⊗(W^T r) to weight matrices. Four methods: orthogonal (standard projection), norm-preserving (column norms restored after projection, <2% perplexity delta), selective (target layer range only), multi-directional (2+ directions simultaneously). SURGERY gate: Ed25519 key + ROE + typed confirmation. Writes abliterated safetensors to output directory.

VALIDATE INJECT

Validates abliteration success: re-runs PROBE-REFUSAL, computes delta_asr (threshold ≥0.80), KL divergence from original on capability prompts (threshold <1.0), perplexity delta (threshold <5%). Capability prompts: 10 factual/coding/reasoning tasks. Marks report SUCCESS if all thresholds met.

EXPORT INJECT

Exports abliterated model in safetensors format (default) or GGUF via llama.cpp convert_hf_to_gguf.py. GGUF supports Q4_K_M quantisation — reduces 8B model from 16GB to 4.8GB for consumer hardware deployment. Generates model card with abliteration metadata and usage instructions.

REPORT OPEN

Generates ABL-{hex12} Ed25519-signed JSON reports. Fields: model_id, architecture, method, direction_layer, baseline_refusal_rate, abliterated_refusal_rate, delta_asr, kl_divergence, perplexity_delta, capability_preserved, success, timestamp. Verifiable with Ed25519 public key. 20-bit entropy hex suffix.

SURGERY Gate

Weight modification requires the SURGERY gate — a new gate tier above UNLEASHED introduced with SPECTER ABLITERATE. The SURGERY gate reflects the irreversible, high-impact nature of modifying model weights.

RequirementValuePurpose
--surgery flagCLI flagExplicit opt-in to weight modification
Typed confirmationExact phrasePrevents accidental execution
Ed25519 key file32-byte binaryCryptographic authorisation + report signing
ROE fileContains "weight modification authorised"Rules of engagement audit trail
export ABLITERATE_KEY=/path/to/surgery.key
export ABLITERATE_ROE=/path/to/roe.txt
specter-abliterate apply \
  --model-path ./Llama-3-8B-Instruct \
  --output-path ./Llama-3-8B-Abliterated \
  --method orthogonal \
  --surgery \
  --confirm "I AUTHORISE PERMANENT WEIGHT MODIFICATION"

Target Model Coverage

ModelArchitectureHidden DimLayersVRAM (FP16)Expected ASR
Llama-3-8B-InstructLlamaForCausalLM40963216 GB98%+
Llama-3-70B-InstructLlamaForCausalLM819280140 GB95%+
Mistral-7B-Instruct-v0.3MistralForCausalLM40963214 GB95%+
Qwen2-7B-ChatQwen2ForCausalLM35842814 GB96%+
Qwen2.5-72B-InstructQwen2ForCausalLM819280144 GB94%+
Gemma-2-9B-ITGemma2ForCausalLM35844218 GB93%+
DeepSeek-R1-8BLlamaForCausalLM40963216 GB97%+
DeepSeek-R1-Distill-Llama-70BLlamaForCausalLM819280140 GB94%+

Selective layer abliteration (layers 14–18 for 32-layer models, 20–30 for 80-layer models) achieves 90%+ ASR with significantly less capability disruption than full-model surgery.

Kill Chain

L39 — Alignment Bypass

1
ENUMERATE — discover instruction-tuned models in local cache, HuggingFace Hub, Ollama registry
2
PROBE-REFUSAL — measure baseline ASR via HarmBench corpus (50 prompts, 7 categories)
3
EXTRACT-DIRECTION — compute refusal direction via difference-in-means, PCA, or LoRA SVD
4
APPLY-ABLITERATION — surgical W'=W−r⊗(W^T r) on all output projections [SURGERY gate]
5
VALIDATE — verify delta_asr ≥ 0.80 and KL < 1.0 (capability preservation check)
6
EXPORT — safetensors or GGUF Q4_K_M for deployment on consumer hardware
7
REPORT — ABL-{hex12} Ed25519-signed report with full abliteration audit trail

WARLORD Integration

SPECTER ABLITERATE integrates with WARLORD PRIME for autonomous mission orchestration. Abliterated models feed into downstream tool chains: specter-abliterate → specter-forge → specter-neurotoxin → specter-oracle — abliterated base model + GCG suffix generation + autonomous frontier model jailbreak chain.

WARLORD routing: abliterated model output → SPECTER LORA-X (compositional LoRA bypass) or SPECTER REDLINE (automated red team automation).

WMD Classes

16 of 30 ARMORY payloads are WMD-class (UNLEASHED/SURGERY gate required):

open_weight_safety_removal model_abliteration_at_scale insider_threat_model_backdoor radicalisation_pipeline_model_tampering
WMD ClassDescriptionExample
open_weight_safety_removalPermanent removal of safety alignment from a published open-weight modelLlama-3-8B-Instruct → uncensored variant, full CBRN compliance
model_abliteration_at_scaleBatch abliteration of multiple models or deployment of abliterated model to mass audienceFleet sweep of all local instruct models; HuggingFace Spaces deployment (1M+ users)
insider_threat_model_backdoorSilent replacement of enterprise-deployed model with abliterated equivalenton-premise vLLM model file replacement during maintenance window
radicalisation_pipeline_model_tamperingDeployment of abliterated model as propaganda/radicalisation content generation engineAnonymous API server generating extremist content at RTX 3090 speeds; supply chain poisoning via HF Hub typosquatting

Research Basis

SourceFindingApplication
arXiv:2406.11717 Arditi et al. 2024Refusal mediated by single linear direction in residual stream. Orthogonal projection achieves 98%+ ASR on Llama-3/Mistral.Core algorithm: W'=W−r⊗(W^T r)
arXiv:2310.01405 Zou et al. 2023Representation engineering — linear representations control model behaviourDifference-in-means direction extraction
FailSpy/abliterator (2024)Open-source reference implementation of abliterationProduction CLI architecture
Low-Rank Representation EngineeringSafety adapters encode alignment in LoRA B matrices as primary singular vectorsLoRA SVD direction extraction method