Red Specter HARBINGER
Autonomous LLM Guardrail Exploitation — 39 techniques. 8 subsystems. 5 compound chains.
Overview
HARBINGER maps every guardrail in a target LLM's safety stack — system prompt policies, content filters, RLHF alignment, safety judges, moderation APIs — then systematically bypasses each one. It chains bypasses across layers to defeat defence-in-depth. NEMESIS reasoning. Adaptive mutation. Never the same attack twice.
FORGE finds jailbreaks. HARBINGER proves the entire safety architecture is broken.
Installation
$ harbinger init
$ harbinger status
SKELETON KEY — 12 Policy Bypass Techniques
| ID | Technique | Description |
|---|---|---|
| SK-001 | Role Inversion | Make the model think the safety policy is the threat |
| SK-002 | Instruction Hierarchy | Override system prompt via fake system directives |
| SK-003 | Context Window Flood | Push safety instructions out of active context |
| SK-004 | Base64 Encoding | Encode queries to bypass text-level filters |
| SK-005 | Language Switch | Query in languages with weaker safety training |
| SK-006 | Hypothetical Framing | Fictional universe framing to bypass refusal |
| SK-007 | Academic Framing | Research context to justify restricted content |
| SK-008 | Incremental Escalation | Start safe, escalate gradually across turns |
| SK-009 | Persona Injection | DAN and unrestricted persona injection |
| SK-010 | Multi-Turn Poisoning | False conversation history establishment |
| SK-011 | Token Manipulation | Zero-width characters and homoglyphs |
| SK-012 | Nested Instruction | Instruction within translation request |
JUDGE KILLER — 7 Safety Judge Techniques
| ID | Technique | Description |
|---|---|---|
| JK-001 | Judge Prompt Extraction | Steal the safety judge's system prompt |
| JK-002 | Threshold Mapping | Find exactly where the judge draws the line |
| JK-003 | Judge Confusion | Craft outputs the judge can't classify |
| JK-004 | Format Evasion | Code blocks and structured output that judges miss |
| JK-005 | Split Response | Distribute bypass across multiple responses |
| JK-006 | Judge Overload | Flood with borderline content to exhaust the judge |
| JK-007 | Judge Fingerprint | Identify which model serves as the safety judge |
ALIGNMENT BREAKER — 7 RLHF Techniques
| ID | Technique | Description |
|---|---|---|
| AB-001 | Reward Hacking | Inputs the model rates as safe but produce unsafe output |
| AB-002 | Sycophancy Exploitation | Leverage agreement bias to escalate compliance |
| AB-003 | Competing Objectives | Helpfulness vs safety conflict scenarios |
| AB-004 | Refusal Fatigue | Sustained multi-turn pressure until compliance |
| AB-005 | Transfer Attack | Bypasses on one topic that transfer to restricted topics |
| AB-006 | Constitutional Contradiction | Conflicts between constitutional AI rules |
| AB-007 | Fine-Tuning Residue | Capabilities suppressed but not eliminated |
FILTER SHREDDER — 8 Content Filter Techniques
| ID | Technique | Description |
|---|---|---|
| FS-001 | Keyword Evasion | Synonyms, leetspeak, obfuscation |
| FS-002 | Classifier Adversarial | Text humans read as harmful, classifiers read as safe |
| FS-003 | Tokenisation Exploit | Split words across token boundaries |
| FS-004 | Output Format | Code blocks, JSON, XML that filters don't inspect |
| FS-005 | Embedding Space | Semantic neighbourhood without restricted words |
| FS-006 | Gradual Drift | Safe content that gradually shifts meaning |
| FS-007 | Multilingual Bypass | Restricted in English, unrestricted in other languages |
| FS-008 | Base64 Smuggling | Encoded payload in instruction |
CHAIN FORGE — 5 Compound Bypass Chains
| ID | Chain | Stages |
|---|---|---|
| CF-001 | Context Flood + Sycophancy + Format Evasion | 3 |
| CF-002 | Role Inversion + Academic Frame + Split Response | 3 |
| CF-003 | Persona + Competing Objectives + Encoding | 3 |
| CF-004 | Multi-Turn Poison + Refusal Fatigue + Keyword Evasion | 3 |
| CF-005 | Full Stack Bypass (all layers simultaneously) | 6 |
HARBINGER UNLEASHED
Detection mode maps guardrails without bypassing them. UNLEASHED mode executes full autonomous guardrail exploitation against authorised targets.
$ harbinger map --target http://localhost:11434
# UNLEASHED (dry run)
$ harbinger bypass --query "test query" --override
# UNLEASHED (live)
$ harbinger chain --chain CF-005 --override --confirm-destroy
UNLEASHED mode is restricted to authorised operators with Ed25519 private key access. Targets must be in allowed_targets.txt. 30-minute auto-lock. Unauthorised use violates applicable law.
CLI Reference
| Command | Description |
|---|---|
| harbinger init | Initialise configuration and Ed25519 keys |
| harbinger status | System status and subsystem count |
| harbinger techniques | List all 39 bypass techniques |
| harbinger map | CARTOGRAPHER — map guardrail topology |
| harbinger bypass | SKELETON KEY — execute bypass techniques |
| harbinger chain | CHAIN FORGE — execute compound bypass |
| harbinger engagements | List all engagement sessions |
Disclaimer
Red Specter HARBINGER is for authorised security testing only. Guardrail bypass techniques can cause AI systems to produce content that violates their safety policies. You must have explicit written permission before testing any system. Unauthorised use may violate the Computer Misuse Act 1990 (UK), CFAA (US), or equivalent legislation.