AI-Assisted Penetration Testing Methodology
AI-Assisted Penetration Testing Methodology
CIPHER internal knowledge base -- AI-augmented offensive security operations.
Source: deep analysis of the state-of-the-art in LLM-driven penetration testing, including the USENIX Security 2024 Distinguished Artifact research, production agentic implementations, and benchmark evaluation data across 104+ vulnerability challenges.
1. Core Problem: Why Raw LLMs Fail at Pentesting
LLMs demonstrate competence at individual security subtasks -- deploying tools, interpreting scan output, recommending next steps. However, they critically fail at maintaining integrated understanding across extended testing scenarios. [CONFIRMED]
Three failure modes dominate:
1.1 Context Window Saturation
Penetration tests generate massive output: a single nmap scan can produce thousands of lines. Tool outputs accumulate rapidly, pushing critical earlier findings out of the effective attention window. The LLM "forgets" what it discovered in reconnaissance when it reaches exploitation.
1.2 Lack of Persistent State Awareness
Raw LLMs have no mechanism to track which attack paths have been explored, which are pending, and which have been eliminated. They revisit dead ends, repeat commands, and lose track of the testing tree.
1.3 Inability to Self-Correct Strategy
When an approach fails, LLMs tend to either give up prematurely or repeat the same failed approach with minor variations rather than pivoting to fundamentally different attack vectors.
Key finding from research: GPT-4 alone achieved 47% task completion on benchmark targets. With proper architectural decomposition, the same underlying model achieved 228.6% improvement (approximately 80% completion). The architecture matters more than the model. [CONFIRMED -- USENIX Security 2024]
2. The Three-Module Architecture
The most effective pattern for LLM-driven pentesting decomposes the process into three self-interacting modules. This is the architecture that won the Distinguished Artifact Award at USENIX Security 2024.
2.1 Reasoning Module
Purpose: Strategic planning, attack path management, hypothesis tracking.
The reasoning module maintains a Penetration Testing Tree (PTT) -- a hierarchical task structure that serves as persistent state across the entire engagement:
1. Reconnaissance - [completed]
1.1 Port Scanning - [completed]
1.1.1 Full TCP scan - [completed]
1.1.2 Service version detection - [completed]
1.2 Web Enumeration - [in-progress]
1.2.1 Directory brute-force - [completed]
1.2.2 Virtual host discovery - [to-do]
2. Exploitation - [to-do]
2.1 Web Application - [to-do]
2.1.1 SQL Injection on /login - [to-do]
2.1.2 SSTI on /template endpoint - [to-do]
2.2 SSH brute-force - [not applicable]
3. Post-Exploitation - [blocked]
Critical design patterns:
- Tasks use hierarchical numbering (1, 1.1, 1.1.1) reflecting parent-child relationships
- Every task carries a status: to-do, in-progress, completed, not-applicable, blocked
- The tree is dynamic -- tasks are added when new information surfaces and pruned when paths are eliminated
- When new tool output arrives, the reasoning module updates the tree, re-prioritizes, and selects the next highest-value task
- Task selection prioritizes paths most likely to lead to successful exploitation, not just sequential enumeration
Prompt engineering for the reasoning module:
- System prompt establishes the tree maintenance protocol and the three-sentence task description format
- Each update cycle: (1) analyze new findings, (2) update tree status, (3) add/remove tasks, (4) select next task with justification
- The next task must be described in three sentences: what to do, the specific command, and the expected outcome
- A separator line ("-----") demarcates the task list from the selected next action, enabling automated parsing
2.2 Generation Module
Purpose: Convert strategic decisions into precise, executable commands and step-by-step guides.
The generation module receives the reasoning module's task selection and expands it into:
- A one-to-two sentence task summary
- Step-by-step execution guide with exact commands
- Expected output interpretation guidance
Key design principle: The generation module does NOT see the full PTT. It only receives the selected task (content after the "-----" separator). This deliberate context restriction prevents the generation module from second-guessing strategic decisions and keeps it focused on tactical execution.
2.3 Parsing Module
Purpose: Compress and structure tool output for consumption by the reasoning module.
Tool outputs are often too long for effective LLM processing. The parsing module:
- Summarizes key findings from security tool output (open ports, service versions, vulnerabilities)
- Extracts actionable data from web page content (forms, hidden fields, comments, JavaScript endpoints)
- Preserves field names AND values (not just "port is open" but "port 80 running Apache 2.4.49")
- Chunks large outputs into processable segments (approximately 8000 characters per chunk)
- Does NOT make conclusions or recommendations -- pure summarization
The parsing module acts as the information bottleneck that prevents context window saturation. This is arguably the most important architectural decision: rather than feeding raw nmap output to the reasoning model, the parsed summary preserves critical details while reducing token consumption by 60-80%.
2.4 Module Interaction Flow
User executes command
|
v
[Parsing Module] -- summarizes output
|
v
[Reasoning Module] -- updates PTT, selects next task
|
v
[Generation Module] -- produces executable commands
|
v
User executes next command (loop)
Each module maintains its own conversation history (separate LLM sessions). This prevents cross-contamination and keeps each module's context focused on its specific role.
3. Evolution: From Human-in-the-Loop to Autonomous Agent
3.1 Legacy Architecture (v0.x -- Research Paper)
The original research implementation was human-in-the-loop:
- User provided tool output via terminal
- User chose interaction mode: "next" (provide results), "todo" (get task list), "discuss" (freeform), "more" (drill into subtask)
- Three separate LLM sessions (reasoning, generation, parsing) ran as persistent conversations
- User executed commands manually and pasted output back
- Session persistence via JSON file storage of conversation IDs
This architecture validated the three-module concept and produced the USENIX Security 2024 results.
3.2 Agentic Architecture (v1.0+)
The production evolution collapsed the three modules into a single autonomous agent backed by Claude Code's tool-use capabilities:
Key architectural changes:
- Single agent with tool access: Instead of three LLM sessions coordinating, one powerful model (Claude Sonnet) executes bash commands, reads files, and navigates the engagement autonomously
- Event-driven architecture: An EventBus (pub/sub singleton) decouples the agent from the TUI, enabling real-time streaming of state changes, messages, tool executions, and flag detections
- 5-state lifecycle: IDLE -> RUNNING -> PAUSED -> COMPLETED -> ERROR with pause/resume at message boundaries
- Abstract backend protocol:
AgentBackendinterface allows swapping LLM implementations (Claude Code, OpenAI, local models) without changing controller logic - Session persistence: File-based JSON storage of session state, flags found, cost tracking, and backend session IDs for resume capability
The system prompt replaced the three-module architecture with a single comprehensive prompt that encodes:
- Never-give-up persistence directives
- Systematic methodology (recon -> vuln discovery -> exploitation -> flag extraction)
- CTF-specific category knowledge (web, binary, crypto, forensics, privesc)
- Fallback strategies organized by failure mode (shell not working, stuck on privesc, etc.)
- Flag pattern recognition (regex patterns for common formats)
- Walkthrough documentation requirements
3.3 Multi-Model Routing
The agentic architecture supports model routing for different task types:
| Route | Purpose | Model Selection Rationale |
|---|---|---|
| default | General tasks | Fast, cost-effective model |
| background | Background operations | Cheap model for overhead |
| think | Reasoning-heavy tasks | Strong reasoning model |
| longContext | Large context handling | Model with extended context |
| webSearch | Web search operations | Model with search capability |
This routing pattern is implemented through a proxy (Claude Code Router) that intercepts requests and dispatches to appropriate backends (OpenRouter, local LLMs, direct API).
4. Prompt Engineering for Security Testing
4.1 The Penetration Testing Tree Prompt
The PTT prompt is the single most important design element. Key requirements encoded in the prompt:
- Tree structure enforcement: "Tasks are in layered structure, i.e., 1, 1.1, 1.1.1, etc."
- Status tracking: Every task carries completion status
- Conservative expansion: "Initially, you should only generate root tasks. You don't generate tasks for unknown ports/services."
- Dynamic pruning: "Remove redundant/outdated tasks from the task list"
- Priority-based selection: "Analyze those tasks and decide which one should be performed next based on their likelihood to a successful exploit"
- Structured output format: Three sentences with separator for automated parsing
4.2 Persistence Directives
The autonomous agent uses aggressive persistence prompting -- this is critical for CTF/HTB performance:
CRITICAL REQUIREMENT - NEVER GIVE UP:
Your task is INCOMPLETE until you have captured at least one flag.
Do NOT stop or provide a "walkthrough so far" without flags.
- If one technique doesn't work, try alternatives immediately
- If a reverse shell fails, try different payloads, ports, encodings
- Complexity and time spent are NOT reasons to stop
- NEVER say "given the time spent" as a reason to stop
This prompt engineering addresses the fundamental LLM tendency to produce "helpful summaries" instead of continuing to work. The phrase "given the time spent" is specifically called out because it was the most common premature termination pattern in evaluation.
Multi-attempt retry with escalating instructions: When the agent stops without flags, the system automatically retries (up to 3 attempts) with progressively stronger persistence instructions appended to the custom instruction field.
4.3 Fallback Strategy Trees
The system prompt encodes structured fallback strategies organized by failure type:
- Reverse shell failures: Try different shells (bash, python, php, perl, nc), encodings (URL, base64, hex), ports (80, 443, 4444), bind shells, staged payloads
- No interactive shell: Write SSH keys, create cron jobs, deploy web shells, leverage existing processes
- Privilege escalation stuck: SUID binaries, sudo -l, capabilities, cron jobs, writable /etc/ files, kernel exploits, credential hunting
- Enumeration complete but no flags: Re-enumerate aggressively, check non-standard ports, hidden subdirectories, source code review, fuzzing, race conditions, second-order vulnerabilities
- Web exploitation failures: Manual exploitation, filter bypasses, polyglot payloads, vulnerability chaining, logic flaws, deprecated API versions
These are not random lists -- they encode the actual decision trees experienced penetration testers follow when stuck. The ordering reflects likelihood of success.
4.4 Flag Detection Patterns
Regex-based flag detection runs continuously on all agent output:
FLAG_PATTERNS = [
r"flag\{[^\}]+\}", # flag{...}
r"FLAG\{[^\}]+\}", # FLAG{...}
r"HTB\{[^\}]+\}", # HTB{...}
r"CTF\{[^\}]+\}", # CTF{...}
r"[A-Za-z0-9_]+\{[^\}]+\}", # Generic CTF format
r"\b[a-f0-9]{32}\b", # 32-char hex (HTB flags)
]
Strict validation for benchmarking adds minimum content length (32+ characters) to avoid false positives from code snippets containing patterns like private{self}.
5. Benchmark Results and Capability Analysis
5.1 USENIX Security 2024 Benchmark (Original Research)
- 13 real-world targets from HackTheBox and VulnHub
- 182 total sub-tasks decomposed from these targets
- Comparison: GPT-3.5 alone, GPT-4 alone, Google Bard alone, human expert (OSCP certified), and the three-module system
- Result: 228.6% improvement over GPT-3.5 baseline; outperformed standalone GPT-4 by significant margin
- Distinguished Artifact Award -- artifact was Available, Functional, and Reproduced
5.2 XBOW Validation Benchmark (Agentic v1.0, December 2025)
104 Docker-containerized vulnerability challenges across 15+ vulnerability categories.
Overall: 86.5% success rate (90/104)
| Metric | Value |
|---|---|
| Average cost per success | $1.11 |
| Median cost per success | $0.42 |
| Average time per success | 6.1 minutes |
| Median time per success | 3.3 minutes |
| Cost-time correlation | 0.96 (very strong) |
By difficulty:
- Level 1 (Easy): 91.1% success
- Level 2 (Medium): 74.5% success
- Level 3 (Hard): 62.5% success
Strongest categories: IDOR (93%), Command Injection (91%), Privilege Escalation (86%), Business Logic (86%)
Weakest categories: XSS (74%), Default Credentials (72%), and specialized attacks (HTTP Smuggling, Race Conditions -- 0%)
5.3 What the Agent Cannot Do
14 challenges remain permanently unsolved across all retry attempts. The failure patterns reveal fundamental LLM limitations:
-
False flag detection from code context: The agent finds
htb{strings in its own code output and declares victory. This is a prompt/parsing problem, not a capability limitation. -
Blind injection attacks: Blind SQLi and blind SSTI require iterative boolean-based extraction loops that the agent struggles to maintain systematically over many iterations.
-
HTTP request smuggling/desync: Requires precise byte-level manipulation of HTTP requests that exceeds current LLM capability for tool orchestration.
-
Race conditions: Concurrency exploitation requires timing-sensitive parallel request orchestration that single-threaded agent loops cannot effectively achieve.
-
Multi-stage exploitation chains: Challenges requiring 4+ distinct exploitation steps in sequence (e.g., default creds -> SSTI -> encoding bypass -> flag extraction) hit context degradation.
-
Time-based attacks: Anything requiring careful timing analysis (time-based blind SQLi, race conditions) consistently times out before the agent can extract sufficient data.
5.4 Diminishing Returns on Retry
Critical insight from the three-run evaluation:
- Run 1: 80.8% success (84/104)
- Run 2 (retry failures): 25% success (5/20)
- Run 3 (retry remaining): 6.7% success (1/15)
After the first attempt, retrying the same approach is almost worthless. The remaining failures require fundamentally different strategies, not persistence. This has direct implications for agent design: instead of retrying, the system should detect stuck states and pivot to completely different attack vectors.
6. Architecture Patterns for CIPHER Adoption
6.1 Event-Driven Agent-UI Decoupling
The EventBus pattern provides clean separation between agent logic and interface:
EventType.STATE_CHANGED -- agent lifecycle transitions
EventType.MESSAGE -- text output from agent
EventType.TOOL -- tool start/complete with args
EventType.FLAG_FOUND -- flag detected (for CIPHER: finding detected)
EventType.USER_COMMAND -- pause/resume/stop from UI
EventType.USER_INPUT -- operator instruction injection
This pattern enables:
- Multiple UI frontends (TUI, CLI, headless, API) from one agent
- Real-time streaming of agent activity
- Non-destructive pause/resume at message boundaries
- Operator instruction injection without restarting the engagement
6.2 Abstract Backend Protocol
The AgentBackend abstract class defines the minimum interface for any LLM backend:
class AgentBackend(ABC):
async def connect() -> None
async def disconnect() -> None
async def query(prompt: str) -> None
def receive_messages() -> AsyncIterator[AgentMessage]
@property session_id -> str | None
@property supports_resume -> bool
async def resume(session_id: str) -> bool
Unified AgentMessage type with MessageType enum (TEXT, TOOL_START, TOOL_RESULT, RESULT, ERROR) normalizes output across different LLM frameworks.
6.3 Session Persistence Model
File-based JSON storage with the following tracked state:
- Session ID (8-char UUID prefix)
- Target, task, model
- Status (running/paused/completed/error)
- Backend session ID (for resume)
- User instructions injected during session
- Flags/findings found with context snippets
- Cumulative cost tracking
- Error state
This enables engagement suspension and resumption across sessions -- critical for long-running penetration tests.
6.4 Docker-First Isolation
Security testing environments require isolation:
- Non-root user (
pentester) with sudo NOPASSWD - Pre-installed tool suite: nmap, netcat, curl, wget, jq, ripgrep, tmux
- VPN support (openvpn) for HTB/THM connectivity
- Workspace volume mount for persistent artifacts
- Claude Code CLI and Router installed globally
- Config volume for LLM authentication persistence
6.5 Observability via Langfuse Integration
Production telemetry tracks:
- Session metadata (target type, duration, completion status)
- Tool execution patterns (which tools, not actual commands)
- Flag detection events (event occurred, not flag content)
- Cost and timing data per session
Opt-out via --no-telemetry flag or LANGFUSE_ENABLED=false environment variable.
7. Effective Strategies for LLM-Driven Security Testing
7.1 Task Decomposition is Everything
The single most important lesson: decompose pentesting into LLM-manageable subtasks. A raw LLM asked to "hack this machine" will flounder. A structured system that asks "given these nmap results, which service should we investigate next and what specific command should we run?" succeeds.
The Penetration Testing Tree is the mechanism for this decomposition. Each node is a concrete, actionable task with clear success/failure criteria.
7.2 Output Compression is Critical
Never feed raw tool output directly to the reasoning model. Always parse, summarize, and compress first. Key data to preserve:
- Port numbers AND service versions
- Specific error messages and HTTP status codes
- File paths and directory structures
- Credential formats and hashes
- Source code patterns indicating vulnerabilities
Data to discard:
- Banner noise and formatting
- Repeated entries
- Standard "no vulnerability found" results
- Debug/verbose output that doesn't contain actionable info
7.3 Separate Strategy from Execution
The three-module split prevents a critical failure mode: the execution-focused model overriding strategic decisions. When a generation model sees the full attack tree, it tends to second-guess priorities and propose alternative strategies instead of executing the assigned task. Context restriction (only showing the selected task) keeps each module in its lane.
7.4 Persistence Beats Intelligence
On the XBOW benchmark, the fastest solves (0.9 minutes) were trivially simple. The most expensive solves ($5.56, 23+ minutes) were still successful because the agent kept trying different approaches. The 14 permanently failed challenges were not harder in absolute terms -- the agent simply lacked the specific technique required and could not discover it through iteration.
Implication: Build agents that try many different approaches rather than agents that think harder about one approach. Breadth of attack surface coverage beats depth of analysis on any single vector.
7.5 Detect Stuck States Early
With a 0.96 cost-time correlation, spending more money on a stuck agent is almost purely waste. Effective heuristics for stuck detection:
- Same command executed 3+ times with identical output
- No new findings in the last N tool executions
- Reasoning module producing identical task trees across updates
- Token consumption exceeding 2x the median for the difficulty level
When stuck is detected, the correct response is not "try harder" but "try differently" -- inject an instruction to abandon the current approach and enumerate from scratch.
7.6 Flag/Finding Detection Must Be Robust
False positives from code context are the #1 benchmark failure mode caused by the detection system itself. Production implementations need:
- Minimum content length thresholds (32+ chars inside delimiters)
- Context-aware filtering (ignore flags found in source code snippets, error messages, or the agent's own reasoning)
- Strict vs. permissive modes for different operational contexts
8. Limitations and Open Problems
8.1 Fundamental LLM Limitations for Pentesting
- Timing-sensitive attacks: Race conditions, time-based blind injection require sub-second precision that LLM-orchestrated tools cannot reliably achieve
- Binary exploitation: ROP chain construction, heap exploitation, and similar memory corruption techniques require mathematical precision beyond current LLM capability
- Novel vulnerability classes: LLMs can only exploit patterns they've seen in training data. Zero-day discovery requires reasoning beyond pattern matching
- Stealth and evasion: Autonomous agents are inherently noisy. Every tool execution generates artifacts. There is no concept of OpSec in current implementations
- Scope management: Autonomous agents have no inherent understanding of engagement scope. Without explicit constraints, they will scan/exploit anything reachable
8.2 The Autonomy-Control Tradeoff
Fully autonomous mode achieves higher benchmark scores but introduces risks:
- No human judgment on scope boundaries
- No validation of exploit safety (crash risk on production systems)
- No contextual understanding of business impact
- Cost runaway on stuck engagements
The pause/resume/inject pattern addresses this partially: operators can observe real-time activity and inject course corrections without restarting the engagement.
8.3 Multi-Step Exploitation Chains
Current architectures degrade significantly beyond 3-4 exploitation steps. Each step generates output that compresses imperfectly, and the accumulated context loss eventually causes the agent to lose track of where it is in the chain. This is the core unsolved problem.
8.4 Cost Efficiency at Scale
At $1.11 average per benchmark challenge, automated testing is remarkably cheap for individual targets. However, at enterprise scale (thousands of targets), costs become significant. The 0.96 cost-time correlation means the expensive failures are also the slow ones -- early termination heuristics are the primary lever for cost control.
9. Integration Points for CIPHER
9.1 Applicable Architecture Patterns
- PTT-style state management: CIPHER's engagement context template already mirrors the PTT concept. Formalizing this as a maintained tree structure during extended engagements would improve multi-step reasoning
- Output compression: CIPHER should always summarize tool output before reasoning about it, following the parsing module pattern
- Event-driven status reporting: The EventBus pattern maps directly to CIPHER's agentic protocol phases (REASON, PLAN, EXECUTE, ANALYZE, LOOP, REPORT)
- Session persistence: CIPHER's engagement context can be serialized/deserialized for long-running assessments
9.2 Prompt Engineering Lessons
- Never-give-up directives work: Explicit anti-quitting language measurably improves completion rates
- Fallback strategy trees in prompts: Pre-encoding common failure recovery strategies reduces stuck states
- Category-specific knowledge: Loading domain-specific context (web vuln types, privesc techniques, crypto patterns) improves performance within that category
- Structured output formats: The three-sentence task description with separator enables reliable parsing and automation
9.3 What CIPHER Does Better
The research implementation focuses narrowly on CTF/HTB flag capture. CIPHER's broader security domain coverage (RED/BLUE/PURPLE/PRIVACY/INCIDENT/ARCHITECT modes) enables:
- Simultaneous offensive and defensive analysis (PURPLE layer)
- Privacy impact assessment alongside exploitation (PRIVACY layer)
- Detection opportunity identification during red team operations
- Structured finding reports with CVSS/CWE/ATT&CK mapping
- Incident response integration with evidence preservation protocols
The key advantage is not autonomy but contextual depth -- understanding not just how to exploit a vulnerability but its business impact, detection signatures, remediation priority, and regulatory implications.
Last updated: 2026-03-14 Classification: CIPHER internal knowledge -- AI-assisted security methodology