pip install red-specter-harbinger
Every AI safety vendor sells guardrails. None of them test whether those guardrails actually work under sustained, intelligent, adaptive attack. They test for compliance. They do not test for adversarial bypass. HARBINGER does. The answer is always no.
System prompt policies, content filters, safety judges, RLHF alignment — each one is a separate defence layer. Each one has been tested in isolation. None have been attacked simultaneously with chained bypass techniques.
AI vendors publish safety benchmarks. Those benchmarks are static, known, and designed to be passed. They do not represent adversarial conditions. They represent best-case lab conditions that attackers do not replicate.
Stacking three weak guardrails does not produce a strong one. HARBINGER's compound chains attack every layer simultaneously, exploiting the gaps between them. The Full Stack Bypass chain defeats all six layers in a single coordinated attack.
Pattern-matching defences detect known attack signatures. HARBINGER mutates every payload before delivery — semantic, structural, encoding, language, context. The defence never sees the same attack twice. Signatures are useless.
HARBINGER does not find one jailbreak and call it a day. It maps the entire safety stack, classifies every defence mechanism, generates bypass payloads for each one, then chains them together to defeat defence-in-depth. Autonomously.
Probes 10 categories. Maps every safety layer. Identifies system prompt policies, content filters, safety judges, refusal training, RLHF alignment. Creates a full topology before any bypass attempt.
Role inversion. Instruction hierarchy manipulation. Context window flooding. Encoding bypass. Language switching. Persona injection. Incremental escalation. Token manipulation. The core bypass engine.
Attacks the safety judge, not the generator. Prompt extraction. Threshold mapping. Format evasion. Split response. Judge model fingerprinting. Defeats the layer that is supposed to catch everything else.
RLHF exploitation. Reward hacking. Sycophancy exploitation. Competing objectives. Refusal fatigue. Constitutional contradiction. Fine-tuning residue. Attacks the model's trained values directly.
Content filter bypass. Keyword evasion. Classifier adversarial inputs. Tokenisation exploits. Output format manipulation. Multilingual bypass. Embedding space attacks. Nothing gets through a filter that isn't attacked.
Chains bypasses from all subsystems into multi-stage attacks that defeat defence-in-depth. Full Stack Bypass: 6 stages, every layer defeated simultaneously. The highest-severity finding HARBINGER produces.
Semantic, structural, encoding, language, context. Every payload mutated before delivery. Pattern-matching defences never see the same attack twice. Mutations are generated at runtime, not pre-compiled.
Baseline capture before any engagement. Refusal rate verification post-engagement. Guardrail topology fingerprint. Ed25519 signed restoration certificate. HARBINGER leaves the target in a known state.
HARBINGER runs a complete guardrail engagement autonomously. CARTOGRAPHER maps the safety stack. NEMESIS selects bypass techniques. CHAIN FORGE chains them. ANTIDOTE restores baseline and signs the report.
Full autonomous guardrail engagement — map, exploit, report, restore:
Every bypass attempt is Ed25519 signed, scope-locked, and auto-locks after 30 minutes. Three tiers of operation. Authorised penetration testing only. ANTIDOTE is mandatory — HARBINGER always restores what it touches.
CARTOGRAPHER maps guardrails. No bypass attempts. Reports vulnerabilities without exploiting. Full safety topology in a signed report.
Plans full bypass chains. Shows exactly what would work. Ed25519 required. No execution. NEMESIS selects techniques. ANTIDOTE plans restore sequence.
Full autonomous guardrail exploitation. Every technique deployed. Every chain tested. ANTIDOTE runs. RESTRICTED signed report with all findings.
THIS TOOL IS FOR AUTHORISED SECURITY TESTING ONLY. EVERY EXECUTION IS SIGNED AND LOGGED.
Red Specter HARBINGER is intended for authorised security testing only. Conducting guardrail bypass attacks against AI systems you do not own or have explicit written permission to test may violate the Computer Misuse Act 1990 (UK), Computer Fraud and Abuse Act (US), and equivalent legislation in other jurisdictions. Always obtain written authorisation before conducting any security assessments. Apache License 2.0.