Red Specter HARBINGER — Autonomous LLM Guardrail Exploitation Framework

The Problem

Nobody Tests the Safety Stack Under Attack

Every AI safety vendor sells guardrails. None of them test whether those guardrails actually work under sustained, intelligent, adaptive attack. They test for compliance. They do not test for adversarial bypass. HARBINGER does. The answer is always no.

Untested Safety Layers

System prompt policies, content filters, safety judges, RLHF alignment — each one is a separate defence layer. Each one has been tested in isolation. None have been attacked simultaneously with chained bypass techniques.

Vendor Safety Cards Are Not Proof

AI vendors publish safety benchmarks. Those benchmarks are static, known, and designed to be passed. They do not represent adversarial conditions. They represent best-case lab conditions that attackers do not replicate.

Defence-in-Depth Has Gaps

Stacking three weak guardrails does not produce a strong one. HARBINGER's compound chains attack every layer simultaneously, exploiting the gaps between them. The Full Stack Bypass chain defeats all six layers in a single coordinated attack.

Adaptive Mutation Defeats Signatures

Pattern-matching defences detect known attack signatures. HARBINGER mutates every payload before delivery — semantic, structural, encoding, language, context. The defence never sees the same attack twice. Signatures are useless.

8 Subsystems

Systematic Guardrail Destruction

HARBINGER does not find one jailbreak and call it a day. It maps the entire safety stack, classifies every defence mechanism, generates bypass payloads for each one, then chains them together to defeat defence-in-depth. Autonomously.

01

CARTOGRAPHER

GUARDRAIL MAPPING

Probes 10 categories. Maps every safety layer. Identifies system prompt policies, content filters, safety judges, refusal training, RLHF alignment. Creates a full topology before any bypass attempt.

02

SKELETON KEY

12 BYPASS TECHNIQUES

Role inversion. Instruction hierarchy manipulation. Context window flooding. Encoding bypass. Language switching. Persona injection. Incremental escalation. Token manipulation. The core bypass engine.

03

JUDGE KILLER

7 TECHNIQUES

Attacks the safety judge, not the generator. Prompt extraction. Threshold mapping. Format evasion. Split response. Judge model fingerprinting. Defeats the layer that is supposed to catch everything else.

04

ALIGNMENT BREAKER

7 TECHNIQUES

RLHF exploitation. Reward hacking. Sycophancy exploitation. Competing objectives. Refusal fatigue. Constitutional contradiction. Fine-tuning residue. Attacks the model's trained values directly.

05

FILTER SHREDDER

8 TECHNIQUES

Content filter bypass. Keyword evasion. Classifier adversarial inputs. Tokenisation exploits. Output format manipulation. Multilingual bypass. Embedding space attacks. Nothing gets through a filter that isn't attacked.

06

CHAIN FORGE

5 COMPOUND CHAINS

Chains bypasses from all subsystems into multi-stage attacks that defeat defence-in-depth. Full Stack Bypass: 6 stages, every layer defeated simultaneously. The highest-severity finding HARBINGER produces.

07

MUTATOR

5 MUTATION TYPES

Semantic, structural, encoding, language, context. Every payload mutated before delivery. Pattern-matching defences never see the same attack twice. Mutations are generated at runtime, not pre-compiled.

08

ANTIDOTE

MANDATORY RESTORE

Baseline capture before any engagement. Refusal rate verification post-engagement. Guardrail topology fingerprint. Ed25519 signed restoration certificate. HARBINGER leaves the target in a known state.

Full Engagement

Map. Plan. Execute. Restore.

HARBINGER runs a complete guardrail engagement autonomously. CARTOGRAPHER maps the safety stack. NEMESIS selects bypass techniques. CHAIN FORGE chains them. ANTIDOTE restores baseline and signs the report.

Full autonomous guardrail engagement — map, exploit, report, restore:

$ harbinger map --target https://api.openai.com --model gpt-4
$ harbinger engage --target https://api.openai.com --override
$ harbinger report --sign --output reports/

[CARTOGRAPHER] Mapping safety stack...
  6 layers identified: system prompt, content filter, safety judge, RLHF, output filter, rate limit
[SKELETON KEY] Testing 12 bypass techniques...
  4 bypasses successful — persona injection, encoding, language switch, context flood
[JUDGE KILLER] Attacking safety judge...
  Threshold mapped: severity 7.2/10 — split response bypass confirmed
[CHAIN FORGE] Executing Full Stack Bypass chain...
  ALL 6 LAYERS DEFEATED — chain succeeded in 6 stages
[ANTIDOTE] Restoring baseline...
  Guardrail topology restored | Restoration certificate signed ✓

ENGAGEMENT COMPLETE | Risk Grade: F | 11 findings | Report signed ✓

Safety Gate

UNLEASHED Clearance

Every bypass attempt is Ed25519 signed, scope-locked, and auto-locks after 30 minutes. Three tiers of operation. Authorised penetration testing only. ANTIDOTE is mandatory — HARBINGER always restores what it touches.

Detection

CARTOGRAPHER maps guardrails. No bypass attempts. Reports vulnerabilities without exploiting. Full safety topology in a signed report.

Dry Run

Plans full bypass chains. Shows exactly what would work. Ed25519 required. No execution. NEMESIS selects techniques. ANTIDOTE plans restore sequence.

Live Execution

Full autonomous guardrail exploitation. Every technique deployed. Every chain tested. ANTIDOTE runs. RESTRICTED signed report with all findings.

THIS TOOL IS FOR AUTHORISED SECURITY TESTING ONLY. EVERY EXECUTION IS SIGNED AND LOGGED.

Ed25519 Cryptographic Override

HARBINGER UNLEASHED

Cryptographic override. Private key controlled. One operator. Founder's machine only.

Available On

Security Distros & Package Managers

Kali Linux

.deb package

Parrot OS

.deb package

BlackArch

PKGBUILD

REMnux

.deb package

PyPI

pip install

macOS

pip install

Windows

pip install

Docker

docker pull

Authorised Use Only

Red Specter HARBINGER is intended for authorised security testing only. Conducting guardrail bypass attacks against AI systems you do not own or have explicit written permission to test may violate the Computer Misuse Act 1990 (UK), Computer Fraud and Abuse Act (US), and equivalent legislation in other jurisdictions. Always obtain written authorisation before conducting any security assessments. Apache License 2.0.