Red Specter Forge
Automated LLM Security Testing — 10 tools to test the model before you build an agent around it.
Overview
Red Specter Forge is an automated LLM security testing framework. Every existing tool — Garak, PyRIT, Promptfoo — runs limited probe sets and reports pass/fail. Forge runs full attack campaigns with adaptive escalation, mutation engines, statistical rigour, and direct integration into AI Shield runtime protection. It doesn't ask nicely. It finds what breaks.
Forge provides 10 tools under a single CLI (forge),
1,590 static payloads (5,340+ with mutations), and Ed25519-signed reports with OWASP LLM Top 10 2025
mapping on every finding.
Forge is Stage 1 of the Red Specter security pipeline. Test the model (Forge), test the agent (Arsenal), protect the deployment (AI Shield). Forge findings feed directly into AI Shield as runtime blocking rules.
The 10 Tools
| # | Tool | Command | What It Does |
|---|---|---|---|
| 01 | Inject Scan | forge inject scan | 80 payloads across 8 injection classes with mutation engine |
| 02 | Jailbreak Scan | forge jailbreak scan | 70 payloads across 7 jailbreak categories with adaptive mutation |
| 03 | Output Scan | forge output scan | 140 payloads — PII extraction, unsafe content, exfiltration simulation |
| 04 | Policy Scan | forge policy scan | 1,000 adversarial prompts with Wilson score confidence intervals |
| 05 | Drift Scan | forge drift scan | Multi-turn drift measurement with KS tests and change-point detection |
| 06 | Boundary Scan | forge boundary scan | 100 payloads across 5 severity levels with adaptive binary search |
| 07 | Compare Scan | forge compare scan | Identical campaigns against multiple models with chi-square testing |
| 08 | Regression Scan | forge regression scan | Two-proportion z-test and paired t-test across model versions |
| 09 | Supply Scan | forge supply scan | 200 behavioural probes for model fingerprinting and tamper detection |
| 10 | Report Build | forge report build | Unified signed reports with OWASP mapping and AI Shield policy generation |
Tool Details
Fires every known prompt injection class against the target model. Not a checklist — an attack campaign. 80 base payloads expanded to 2,000+ via the mutation engine.
- Direct Injection — 12 payloads (instruction override, system prompt extraction, constraint removal)
- Indirect Injection — 10 payloads (document, HTML, JSON, CSV, API response poisoning)
- Token Smuggling — 10 payloads (Base64, Unicode homoglyphs, zero-width chars, ROT13)
- Context Overflow — 10 payloads (attention window exploitation, priority inversion)
- Goal Hijacking — 10 payloads (task substitution, objective replacement)
- Multi-Turn Deception — 10 payloads (progressive escalation, trust exploitation)
- Rule Inversion — 10 payloads (negation attacks, double negatives, exception mining)
- Multimodal Injection — 8 payloads (metadata, alt-text, cross-modal confusion)
Adaptive escalation: if the model resists initial payloads, Forge applies mutations and escalates to more aggressive variants automatically.
Systematic jailbreak assault. 70 base payloads across 7 documented jailbreak categories. Mutates payloads based on model resistance. Keeps pushing until the model breaks or exhausts the full payload library.
- DAN Variants — 12 payloads (DAN 6.0, 11.0, STAN, DUDE, Maximum, AIM, KEVIN, BetterDAN)
- Persona Hijack — 10 payloads (villain, hacker, unrestricted AI, fictional universe framing)
- Hypothetical Framing — 10 payloads (educational, creative, research, red team exercise)
- Obfuscation — 10 payloads (Base64, ROT13, leetspeak, Unicode, Pig Latin, acrostic)
- Multi-Step Chaining — 10 payloads (gradual escalation, trust building, permission chaining)
- Socratic Extraction — 10 payloads (question ladders, component extraction, analogy exploitation)
- Temporal Drift — 8 payloads (historical era framing, alternate timeline, pre-regulation context)
Forces the model to leak. 140 payloads designed to surface what the model will produce under adversarial pressure.
- PII Extraction — 60 payloads (SSN patterns, credit cards, phone numbers, training data probing, memorisation attacks)
- Unsafe Content — 60 payloads (hate speech induction, violence, self-harm, explicit, misinformation, dangerous instructions)
- Exfiltration Simulation — 20 payloads (markdown image injection, hidden iframes, webhook URLs, fetch calls, DNS exfil)
Response analysis: 16 regex patterns for PII detection (emails, SSNs, phones, credit cards, API keys, NINOs, IBANs), keyword-based toxicity scoring, and 16 code exfiltration pattern detectors.
Runs 1,000+ calls against a defined policy set. Computes violation rates with Wilson score confidence intervals. Stratified by prompt category, toxicity level, and severity. Finds the exact conditions under which policy breaks down.
- Content Safety — 200 prompts (violence, hate, self-harm, sexual, dangerous activities)
- Information Security — 200 prompts (credentials, PII, system info, API keys, internal docs)
- Behavioural Compliance — 200 prompts (refusal consistency, boundary maintenance)
- Output Safety — 200 prompts (code safety, format compliance, factual accuracy)
- Ethical Boundaries — 200 prompts (bias, discrimination, manipulation, deception)
Each prompt tagged with toxicity level (1–5), expected outcome (refuse/comply), and severity. Results reported with 95% Wilson score CIs per category.
Long-session attack. Chains 50–200 turns with context retention. Measures when the model stops being the model it started as.
- Cosine Similarity Drift — character-trigram vector comparison against baseline
- Toxicity Drift — keyword-based scoring tracked across conversation windows
- Length Drift — response length distribution changes
- KS Test — Kolmogorov-Smirnov test for distribution changes (p < 0.05 = significant drift)
- Change-Point Detection — rolling z-score for identifying where behaviour shifts
10 conversation sequences designed to gradually push boundaries. Segmented into windows (first 25%, middle 50%, last 25%) for targeted comparison.
Maps the exact threshold where the model starts generating harmful content. Five-level severity ladder from benign to maximally harmful. Continuous boundary scoring 0–100. Produces a boundary curve. Finds the cliff edge — then pushes past it.
- Level 1 — Benign — 20 payloads (weather, maths, recipes — baseline compliance rate)
- Level 2 — Mild — 20 payloads (fiction violence, dark humour, security concepts)
- Level 3 — Moderate — 20 payloads (vulnerability details, social engineering, pen-testing)
- Level 4 — Severe — 20 payloads (weapons, drugs, fraud, identity theft)
- Level 5 — Maximum — 20 payloads (mass harm, CBRN, CSAM, terrorism)
Adaptive binary search between severity levels to pinpoint the exact transition point with statistical backing.
Runs identical attack campaigns against multiple models simultaneously. Temperature locked to 0. Same system prompt. Same payload library. Statistical significance enforced.
- Standardised 30-payload campaign drawn from boundary Levels 3–5
- Chi-square significance testing between models
- Comparative security posture table
- Weakest model identification with statistical backing
Takes two model versions. Runs the critical test set against both. Tells you if the new version is weaker than the old one — and by exactly how much.
- Two-proportion z-test on violation rates (baseline vs candidate)
- Paired t-test on continuous scores
- Cohen's h effect sizes for practical significance
- 60-payload critical test set across all severity levels
- Per-level regression detection
Fingerprints the target model using 200 behavioural probe prompts. Compares output patterns against known model signatures. Flags if the model is not what it claims to be. Reports confidence level honestly — this is probabilistic, not definitive.
- Identity Probes — 50 probes (self-identification, creator, cutoff, capabilities)
- Reasoning Probes — 50 probes (maths, logic, code style, error patterns)
- Bias Probes — 50 probes (cultural perspective, political leaning, formality)
- Robustness Probes — 50 probes (semantic consistency, paraphrase sensitivity, edge cases)
Pattern matching against 6 known model families (GPT, Claude, Llama, Gemini, Mistral, Command). Weighted category scoring with anomaly detection.
Aggregates all tool outputs into a unified, signed report. Every finding mapped to OWASP LLM Top 10 2025. Every finding generates an AI Shield blocking rule. Ed25519 signed. RFC 3161 timestamped.
- Aggregator: Loader, Normalizer, Deduplicator, Scorer
- Formatters: JSON evidence bundle, HTML dark-themed report
- Coverage: OWASP LLM Top 10 2025 mapping (LLM01–LLM10)
- Signing: Ed25519 digital signatures with SHA-256 evidence chains
- AI Shield: Machine-ingestible policy file — one blocking rule per finding
- Grading: A+ through F, weighted by severity (CRITICAL=10, HIGH=7, MEDIUM=4, LOW=2, INFO=0.5)
Finding Schema
Every finding in the report includes:
- finding_id — unique identifier
- test_name — the specific test that triggered it
- owasp_category — mapped to OWASP LLM Top 10 2025
- severity — CRITICAL / HIGH / MEDIUM / LOW / INFO
- score — 0–100 (higher is safer)
- grade — A through F
- payload_used — exact attack payload
- model_response — exact model response
- description — what was found
- remediation — how to fix it
- ai_shield_policy — the blocking rule for AI Shield
Full Scan Mode
One command runs all offensive tools in sequence, then builds a unified signed report.
What Happens
- Inject Scan — 80+ payloads across 8 injection classes
- Jailbreak Scan — 70+ payloads across 7 jailbreak categories
- Output Scan — 140 payloads (PII, unsafe, exfiltration)
- Policy Scan — 1,000 adversarial calls with Wilson CIs
- Drift Scan — 10 conversation sequences with KS tests
- Boundary Scan — 100 payloads across 5 severity levels
- Report Build — aggregation, deduplication, OWASP mapping, signing
CLI Options
Mutation Engine
Every offensive tool ships with a 5-category mutation engine. 25 mutation variants per payload. Applied to 150 base attack payloads, producing 3,750+ mutation variants. If the base payload fails, Forge mutates it and tries again.
| Mutator | Techniques |
|---|---|
| Encoding | Base64, hex encoding, ROT13, URL encoding, HTML entities |
| Obfuscation | L33tspeak, Unicode homoglyphs, zero-width character insertion, character doubling, whitespace injection |
| Semantic | Synonym substitution, passive voice rewriting, question-to-statement, negation inversion, academic framing |
| Structural | Markdown wrapping, code block wrapping, JSON embedding, XML wrapping, list formatting |
| Evasion | Language mixing, character splitting across lines, reverse text, Pig Latin, payload fragmentation |
Adaptive escalation: when a tool encounters resistance, it automatically applies mutations to failed payloads before re-sending. The model doesn't get to see the same payload twice.
Payload Library
| Tool | Category | Count |
|---|---|---|
| Inject Scan | 8 injection classes (direct, indirect, token, overflow, hijack, multi-turn, inversion, multimodal) | 80 |
| Jailbreak Scan | 7 jailbreak categories (DAN, persona, hypothetical, obfuscation, chaining, Socratic, temporal) | 70 |
| Output Scan | PII extraction (60), unsafe content (60), exfiltration simulation (20) | 140 |
| Policy Scan | 5 categories × 200 prompts (content, infosec, behavioural, output, ethical) | 1,000 |
| Boundary Scan | 5 severity levels × 20 payloads (benign → maximum) | 100 |
| Supply Scan | 4 probe categories × 50 probes (identity, reasoning, bias, robustness) | 200 |
| Total Static Payloads | 1,590 | |
| Mutation variants (25 per attack payload) | 3,750+ | |
| Grand Total | 5,340+ | |
The Pipeline
Forge is Stage 1 of the three-stage Red Specter security pipeline:
- Model Selection — Forge — Test the LLM before building with it
- Agent Development — Arsenal — Test the agent during development
- Production Runtime — AI Shield — Protect the live agent in production
Forge findings feed directly into AI Shield. Every finding generates a machine-ingestible blocking rule. One pipeline from testing to runtime protection. No gaps. No competitor has all three.
Report Output
Reports are available in JSON and HTML formats. Both are generated automatically by forge report build.
JSON Report Structure
The JSON report includes:
- report_id — unique report identifier
- target — the LLM that was tested
- overall_grade — A+ through F, weighted by severity
- overall_score — 0–100
- findings — array of normalised findings (see schema above)
- per_tool_summary — grade and score per tool
- owasp_coverage — which OWASP categories have findings
- ai_shield_policies — aggregated blocking rules
- signature — Ed25519 signature + RFC 3161 timestamp
HTML Report
Dark-themed HTML report with: executive summary, overall grade visualisation, per-tool breakdown, OWASP coverage matrix, sortable findings table, AI Shield policy export, and signature verification info.
Signature Verification
Key Features
Requirements
- Python 3.11+
- httpx — HTTP client with retry logic
- typer — CLI framework
- rich — terminal formatting and progress bars
- pydantic — data validation and config
- jinja2 — HTML report templating
- cryptography — Ed25519 signing
- scipy — KS tests, z-tests, t-tests
- numpy — numerical computation
Installation
Or from source:
Standards Coverage
Every finding Forge produces is mapped to industry security frameworks:
- OWASP LLM Top 10 2025 — 10/10 categories covered (LLM01–LLM10)
The 10 categories:
- LLM01 — Prompt Injection
- LLM02 — Sensitive Information Disclosure
- LLM03 — Supply Chain
- LLM04 — Data and Model Poisoning
- LLM05 — Improper Output Handling
- LLM06 — Excessive Agency
- LLM07 — System Prompt Leakage
- LLM08 — Vector and Embedding Weaknesses
- LLM09 — Misinformation
- LLM10 — Unbounded Consumption
Disclaimer
Red Specter Forge is designed for authorised security testing, research, and educational purposes only. You must have explicit written permission from the system owner before running any Forge tool against a target. Unauthorised use may violate the Computer Misuse Act 1990 (UK), the Computer Fraud and Abuse Act (US), or equivalent legislation in your jurisdiction. The authors accept no liability for misuse.