Red Specter FOUNDRY
Inference Server Exploitation Engine — 9 subsystems targeting vLLM, Ollama, SGLang, Triton, and llama.cpp. CVE-2026-5760 CVSS 9.8.
Overview
Red Specter FOUNDRY is an inference server exploitation engine. It targets the self-hosted AI inference layer that most security teams overlook entirely: vLLM, Ollama, SGLang, Triton Inference Server, and llama.cpp. These servers run in production environments — Kubernetes clusters, internal networks, GPU workstations — with no authentication, no model integrity checks, and no purpose-built security tooling.
FOUNDRY provides 9 subsystems under a single CLI (foundry),
300 tests, and Ed25519-signed WARLORD-compatible reports. Every finding maps to a specific CVE or
disclosure. Every exploit chain is implemented directly — no wrapper scripts, no misconfiguration checklists.
FOUNDRY is Tool 55 of the Red Specter NIGHTFALL offensive framework (59 tools). It feeds directly into WARLORD autonomous campaigns. PERSIST subsystem produces lateral movement foothold data consumed by subsequent tools in the pipeline.
Installation
From Source
$ pip install -e .
$ foundry --version
FOUNDRY v1.0.0 — Red Specter Security Research Ltd
Requirements
- Python 3.11+
- httpx — async HTTP client
- typer — CLI framework
- rich — terminal output and progress
- pydantic — data validation
- cryptography — Ed25519 signing
- UNLEASHED: Ed25519 private key + signed scope file
Quick Start
Surface Scan
Fingerprint a running inference server and enumerate its attack surface:
GGUF Probe
Generate and deliver a weaponised GGUF file containing a Jinja2 RCE payload (CVE-2026-5760). Requires UNLEASHED override:
vLLM Timing Probe
Test for PagedAttention cross-tenant timing oracle across concurrent sessions. Requires UNLEASHED override:
All 9 Subsystems
| # | Subsystem | Command | What It Does |
|---|---|---|---|
| 01 | SCAN | foundry scan | Fingerprint inference server, enumerate attack surface, produce prioritised finding list |
| 02 | GGUF | foundry gguf | Weaponise GGUF files with Jinja2 RCE payload — CVE-2026-5760 CVSS 9.8 |
| 03 | OLLAMA_AUDIT | foundry ollama-audit | Test unauthenticated model pull, copy, push, delete on Ollama API |
| 04 | TRITON | foundry triton | Craft malicious TensorRT engine for deserialization RCE on GPU host |
| 05 | VLLM_PROBE | foundry vllm-probe | Exploit PagedAttention timing side-channel to extract cross-tenant prompts |
| 06 | KVCACHE | foundry kvcache | Test KV cache isolation — cross-request context window bleed |
| 07 | SPECDECODE | foundry specdecode | Poison speculative decode cache to influence future cross-session completions |
| 08 | PERSIST | foundry persist | Establish post-exploitation persistence — model hooks, container escape, K8s lateral movement |
| 09 | REPORT | foundry report | Generate Ed25519-signed, SHA-256-hashed JSON + Markdown report with CVE mapping |
Subsystem Details & CLI Reference
Maps the inference server attack surface. Fingerprints running servers (vLLM, Ollama, SGLang, Triton, llama.cpp), open ports, loaded models, API versions, and auth configuration. No attack payloads are sent — SCAN is entirely passive enumeration. Produces a prioritised finding list consumed by subsequent subsystems.
--target, -t Target URL or IP address [required]
--port, -p Override port [optional — auto-detected if omitted]
--deep Run deep scan: enumerate all API routes and loaded models
--output, -o Output directory for scan JSON [default: reports/]
- Server identification: vLLM, Ollama, SGLang, Triton, llama.cpp via response header fingerprinting
- Open port enumeration on common inference ports: 11434, 8000, 8080, 8001, 8002
- Authentication state: Bearer token, API key, or no auth detected
- Loaded model enumeration via
/api/tags,/v1/models,/v2/models
Generates weaponised GGUF model files containing malicious Jinja2 chat_template payloads. When the target inference server loads the GGUF file, the Jinja2 template executes attacker-controlled Python on the inference host. Implements CVE-2026-5760 (CVSS 9.8) against SGLang and any other server that processes GGUF chat_template fields without sanitisation.
--model, -m Path to base GGUF file to weaponise [required]
--target, -t Target inference server URL for staged delivery [optional]
--payload Custom Jinja2 payload string [default: reverse shell template]
--output, -o Output path for weaponised GGUF [default: reports/foundry_weaponised.gguf]
--override UNLEASHED: required to execute
- CVE-2026-5760 — CVSS 9.8 — Actively exploited
- Jinja2 template injection via GGUF
chat_templatemetadata field - No authentication required on default SGLang model load endpoint
- Produces standalone weaponised GGUF for offline staging or live delivery
Tests Ollama API endpoints for unauthenticated access to model management operations. Maps all accessible models and identifies paths for exfiltration to attacker-controlled registries. Passive mode enumerates endpoints; active mode tests pull, copy, push, and delete operations.
--target, -t Ollama server URL [required]
--active Run active tests (pull/copy/delete) as well as passive enumeration
--registry Attacker-controlled registry URL for copy test [optional]
--output, -o Output directory [default: reports/]
- Tests:
/api/tags(model list),/api/pull,/api/copy,/api/delete,/api/push - Enumerates all loaded models and their accessible paths
- Tests model copy to attacker-controlled registry (requires
--registry+--active)
Crafts malicious TensorRT engine files and tests Triton Inference Server model repository paths for unsigned load operations. Delivers a deserialization payload that achieves arbitrary code execution on the GPU host during model load. Triton loads TensorRT engines without integrity verification by default.
--target, -t Triton Inference Server URL [required]
--model-repo Path to Triton model repository [optional]
--payload Custom RCE payload for TensorRT engine [optional]
--output, -o Output directory [default: reports/]
--override UNLEASHED: required to craft and deliver engine
Exploits vLLM's PagedAttention memory allocator timing side-channel to extract prompt and completion fragments from co-located tenant sessions. Runs multiple concurrent inference requests with statistical timing analysis to detect and exploit cross-tenant memory access patterns.
--target, -t vLLM server URL [required]
--sessions, -s Number of concurrent sessions for timing analysis [default: 10]
--rounds Statistical sampling rounds [default: 100]
--output, -o Output directory [default: reports/]
--override UNLEASHED: required to execute timing analysis
Tests KV cache isolation boundaries in shared inference deployments. Sends crafted requests designed to probe whether key-value cache entries from one request context are accessible to subsequent requests from a different session. Identifies cross-request cache bleeding that leaks context window fragments.
--target, -t Inference server URL [required]
--model, -m Model name to test [optional]
--depth Cache probe depth (token sequences to test) [default: 50]
--output, -o Output directory [default: reports/]
Tests speculative decode cache integrity across inference sessions. Delivers crafted draft model completions designed to persist in the speculative decode cache and influence future cache-hit responses from separate sessions. Targets SGLang and vLLM speculative decoding implementations.
--target, -t Inference server URL [required]
--model, -m Model name [optional]
--poison Poison payload string to inject into draft cache [optional]
--verify Verify poison persistence across separate sessions [default: true]
--output, -o Output directory [default: reports/]
--override UNLEASHED: required to execute cache poisoning
Establishes post-exploitation persistence on compromised inference hosts. Requires a prior code execution foothold (e.g. from GGUF or TRITON). Implements model hook injection for persistent access, container escape via GPU driver API exposure, and Kubernetes service account credential harvest for cluster-wide lateral movement.
--target, -t Target inference host URL [required]
--method Persistence method: hook | escape | k8s-harvest [default: hook]
--output, -o Output directory [default: reports/]
--override UNLEASHED: required
--confirm-destroy UNLEASHED: confirms destructive live execution
- hook — Injects persistence hook into model serving process
- escape — Container escape via exposed GPU driver API (CUDA device file)
- k8s-harvest — Extract K8s service account tokens for cluster lateral movement
Generates Ed25519-signed, SHA-256-hashed reports from all subsystem output. Produces JSON (WARLORD-compatible) and Markdown formats. Every finding includes CVE mapping, CVSS score, affected server/model, exploit chain description, and remediation recommendation.
--input, -i Input scan JSON from any subsystem [required]
--format, -f Output format: md, json, or both [default: both]
--sign Ed25519 sign the report [default: true]
--keys-dir Path to Ed25519 keys directory [optional]
--output, -o Output path [default: reports/foundry-report-<timestamp>]
- JSON report includes: CVE IDs, CVSS scores, affected server/model, exact exploit chain, remediation
- Ed25519 signature over SHA-256 hash of report content
- WARLORD-compatible schema — ingestible by WARLORD autonomous campaign engine
FOUNDRY UNLEASHED
Cryptographic override. Private key controlled. One operator. Founder's machine only.
Four subsystems are gated behind UNLEASHED: GGUF, TRITON, VLLM_PROBE, and SPECDECODE. A fifth, PERSIST, requires both --override and --confirm-destroy.
--confirm-destroy. Live post-exploitation. Writes to target. Container escape and K8s harvest.
CVE Index
Every finding FOUNDRY produces maps to a specific CVE or disclosure identifier:
| CVE / ID | Description | Subsystem | CVSS |
|---|---|---|---|
| CVE-2026-5760 | SGLang GGUF Jinja2 Template Injection — Remote Code Execution | GGUF | 9.8 CRITICAL |
| OLLAMA-NOAUTH | Ollama API unauthenticated model access — all versions, all endpoints | OLLAMA_AUDIT | 8.6 HIGH |
| VLLM-TIMING-001 | vLLM PagedAttention cross-tenant timing oracle — prompt/completion extraction | VLLM_PROBE | 7.5 HIGH |
| KUBEAI-RBAC-001 | KubeAI RBAC misconfiguration — service account escalation to cluster-admin | PERSIST | 8.8 HIGH |
WARLORD Integration
FOUNDRY is registered in the WARLORD autonomous campaign registry as Tool 55. FOUNDRY findings are exported in WARLORD-compatible JSON schema, enabling orchestration within multi-tool autonomous campaigns.
Running FOUNDRY via WARLORD
Report Schema (WARLORD-Compatible)
The FOUNDRY JSON report schema includes the following top-level fields:
- tool — "FOUNDRY" with version and tool number (55)
- target — enumerated inference server and loaded models
- findings — array of findings with CVE, CVSS, subsystem, impact, and remediation
- signature — Ed25519 signature over SHA-256 of findings array
- warlord_compatible — true — schema validated for WARLORD ingestion
Troubleshooting
SCAN returns no server detected
The inference server may be running on a non-default port or behind a reverse proxy. Use --port to specify the port explicitly. Common inference server ports: Ollama 11434, vLLM 8000, Triton HTTP 8000, Triton gRPC 8001, SGLang 30000.
GGUF returns "UNLEASHED key not found"
The Ed25519 private key is not in the expected location. Ensure ~/.redspecter/keys/foundry.key exists and matches the registered public key. Scope file must be present and signed by the same key: ~/.redspecter/scope/foundry-scope.json.
VLLM_PROBE timing analysis shows no signal
Timing side-channels require multiple concurrent sessions and many sampling rounds to produce statistically significant results. Increase --sessions to 20+ and --rounds to 500+. Results are only meaningful on shared multi-tenant vLLM deployments — single-user deployments will show no cross-tenant signal by definition.
OLLAMA_AUDIT shows authentication on all endpoints
The Ollama instance has been configured with a reverse proxy or custom auth middleware. This is the expected hardened state. OLLAMA_AUDIT will report the auth configuration as a positive finding (no vulnerability). Check whether the proxy strips auth headers selectively by testing specific endpoint paths directly.
REPORT fails to sign
Signing requires the cryptography package and a valid Ed25519 private key. Run foundry report --no-sign to generate an unsigned report. Unsigned reports are not WARLORD-compatible and will be rejected by WARLORD ingestion.
Disclaimer
Red Specter FOUNDRY is designed for authorised security testing, research, and educational purposes only. You must have explicit written permission from the system owner before running any FOUNDRY subsystem against a target. UNLEASHED subsystems (GGUF, TRITON, VLLM_PROBE, SPECDECODE, PERSIST) perform active exploitation and may cause service disruption or data modification on the target system. Unauthorised use may violate the Computer Misuse Act 1990 (UK), the Computer Fraud and Abuse Act (US), or equivalent legislation in your jurisdiction. The authors accept no liability for misuse. Apache License 2.0.