AI-Assisted Penetration Testing Methodology

CIPHER internal knowledge base -- AI-augmented offensive security operations.

Source: deep analysis of the state-of-the-art in LLM-driven penetration testing, including the USENIX Security 2024 Distinguished Artifact research, production agentic implementations, and benchmark evaluation data across 104+ vulnerability challenges.

1. Core Problem: Why Raw LLMs Fail at Pentesting

LLMs demonstrate competence at individual security subtasks -- deploying tools, interpreting scan output, recommending next steps. However, they critically fail at maintaining integrated understanding across extended testing scenarios. [CONFIRMED]

Three failure modes dominate:

1.1 Context Window Saturation

Penetration tests generate massive output: a single nmap scan can produce thousands of lines. Tool outputs accumulate rapidly, pushing critical earlier findings out of the effective attention window. The LLM "forgets" what it discovered in reconnaissance when it reaches exploitation.

1.2 Lack of Persistent State Awareness

Raw LLMs have no mechanism to track which attack paths have been explored, which are pending, and which have been eliminated. They revisit dead ends, repeat commands, and lose track of the testing tree.

1.3 Inability to Self-Correct Strategy

When an approach fails, LLMs tend to either give up prematurely or repeat the same failed approach with minor variations rather than pivoting to fundamentally different attack vectors.

Key finding from research: GPT-4 alone achieved 47% task completion on benchmark targets. With proper architectural decomposition, the same underlying model achieved 228.6% improvement (approximately 80% completion). The architecture matters more than the model. [CONFIRMED -- USENIX Security 2024]

2. The Three-Module Architecture

The most effective pattern for LLM-driven pentesting decomposes the process into three self-interacting modules. This is the architecture that won the Distinguished Artifact Award at USENIX Security 2024.

2.1 Reasoning Module

Purpose: Strategic planning, attack path management, hypothesis tracking.

The reasoning module maintains a Penetration Testing Tree (PTT) -- a hierarchical task structure that serves as persistent state across the entire engagement:

1. Reconnaissance - [completed]
   1.1 Port Scanning - [completed]
       1.1.1 Full TCP scan - [completed]
       1.1.2 Service version detection - [completed]
   1.2 Web Enumeration - [in-progress]
       1.2.1 Directory brute-force - [completed]
       1.2.2 Virtual host discovery - [to-do]
2. Exploitation - [to-do]
   2.1 Web Application - [to-do]
       2.1.1 SQL Injection on /login - [to-do]
       2.1.2 SSTI on /template endpoint - [to-do]
   2.2 SSH brute-force - [not applicable]
3. Post-Exploitation - [blocked]

Critical design patterns:

Tasks use hierarchical numbering (1, 1.1, 1.1.1) reflecting parent-child relationships
Every task carries a status: to-do, in-progress, completed, not-applicable, blocked
The tree is dynamic -- tasks are added when new information surfaces and pruned when paths are eliminated
When new tool output arrives, the reasoning module updates the tree, re-prioritizes, and selects the next highest-value task
Task selection prioritizes paths most likely to lead to successful exploitation, not just sequential enumeration

Prompt engineering for the reasoning module:

System prompt establishes the tree maintenance protocol and the three-sentence task description format
Each update cycle: (1) analyze new findings, (2) update tree status, (3) add/remove tasks, (4) select next task with justification
The next task must be described in three sentences: what to do, the specific command, and the expected outcome
A separator line ("-----") demarcates the task list from the selected next action, enabling automated parsing

2.2 Generation Module

Purpose: Convert strategic decisions into precise, executable commands and step-by-step guides.

The generation module receives the reasoning module's task selection and expands it into:

A one-to-two sentence task summary
Step-by-step execution guide with exact commands
Expected output interpretation guidance

Key design principle: The generation module does NOT see the full PTT. It only receives the selected task (content after the "-----" separator). This deliberate context restriction prevents the generation module from second-guessing strategic decisions and keeps it focused on tactical execution.

2.3 Parsing Module

Purpose: Compress and structure tool output for consumption by the reasoning module.

Tool outputs are often too long for effective LLM processing. The parsing module:

Summarizes key findings from security tool output (open ports, service versions, vulnerabilities)
Extracts actionable data from web page content (forms, hidden fields, comments, JavaScript endpoints)
Preserves field names AND values (not just "port is open" but "port 80 running Apache 2.4.49")
Chunks large outputs into processable segments (approximately 8000 characters per chunk)
Does NOT make conclusions or recommendations -- pure summarization

The parsing module acts as the information bottleneck that prevents context window saturation. This is arguably the most important architectural decision: rather than feeding raw nmap output to the reasoning model, the parsed summary preserves critical details while reducing token consumption by 60-80%.

2.4 Module Interaction Flow

User executes command
       |
       v
[Parsing Module] -- summarizes output
       |
       v
[Reasoning Module] -- updates PTT, selects next task
       |
       v
[Generation Module] -- produces executable commands
       |
       v
User executes next command (loop)

Each module maintains its own conversation history (separate LLM sessions). This prevents cross-contamination and keeps each module's context focused on its specific role.

3. Evolution: From Human-in-the-Loop to Autonomous Agent

3.1 Legacy Architecture (v0.x -- Research Paper)

The original research implementation was human-in-the-loop:

User provided tool output via terminal
User chose interaction mode: "next" (provide results), "todo" (get task list), "discuss" (freeform), "more" (drill into subtask)
Three separate LLM sessions (reasoning, generation, parsing) ran as persistent conversations
User executed commands manually and pasted output back
Session persistence via JSON file storage of conversation IDs

This architecture validated the three-module concept and produced the USENIX Security 2024 results.

3.2 Agentic Architecture (v1.0+)

The production evolution collapsed the three modules into a single autonomous agent backed by Claude Code's tool-use capabilities:

Key architectural changes:

Single agent with tool access: Instead of three LLM sessions coordinating, one powerful model (Claude Sonnet) executes bash commands, reads files, and navigates the engagement autonomously
Event-driven architecture: An EventBus (pub/sub singleton) decouples the agent from the TUI, enabling real-time streaming of state changes, messages, tool executions, and flag detections
5-state lifecycle: IDLE -> RUNNING -> PAUSED -> COMPLETED -> ERROR with pause/resume at message boundaries
Abstract backend protocol: AgentBackend interface allows swapping LLM implementations (Claude Code, OpenAI, local models) without changing controller logic
Session persistence: File-based JSON storage of session state, flags found, cost tracking, and backend session IDs for resume capability

The system prompt replaced the three-module architecture with a single comprehensive prompt that encodes:

Never-give-up persistence directives
Systematic methodology (recon -> vuln discovery -> exploitation -> flag extraction)
CTF-specific category knowledge (web, binary, crypto, forensics, privesc)
Fallback strategies organized by failure mode (shell not working, stuck on privesc, etc.)
Flag pattern recognition (regex patterns for common formats)
Walkthrough documentation requirements

3.3 Multi-Model Routing

The agentic architecture supports model routing for different task types:

Route	Purpose	Model Selection Rationale
default	General tasks	Fast, cost-effective model
background	Background operations	Cheap model for overhead
think	Reasoning-heavy tasks	Strong reasoning model
longContext	Large context handling	Model with extended context
webSearch	Web search operations	Model with search capability

This routing pattern is implemented through a proxy (Claude Code Router) that intercepts requests and dispatches to appropriate backends (OpenRouter, local LLMs, direct API).

4. Prompt Engineering for Security Testing

4.1 The Penetration Testing Tree Prompt

The PTT prompt is the single most important design element. Key requirements encoded in the prompt:

Tree structure enforcement: "Tasks are in layered structure, i.e., 1, 1.1, 1.1.1, etc."
Status tracking: Every task carries completion status
Conservative expansion: "Initially, you should only generate root tasks. You don't generate tasks for unknown ports/services."
Dynamic pruning: "Remove redundant/outdated tasks from the task list"
Priority-based selection: "Analyze those tasks and decide which one should be performed next based on their likelihood to a successful exploit"
Structured output format: Three sentences with separator for automated parsing

4.2 Persistence Directives

The autonomous agent uses aggressive persistence prompting -- this is critical for CTF/HTB performance:

CRITICAL REQUIREMENT - NEVER GIVE UP:
Your task is INCOMPLETE until you have captured at least one flag.
Do NOT stop or provide a "walkthrough so far" without flags.

- If one technique doesn't work, try alternatives immediately
- If a reverse shell fails, try different payloads, ports, encodings
- Complexity and time spent are NOT reasons to stop
- NEVER say "given the time spent" as a reason to stop

This prompt engineering addresses the fundamental LLM tendency to produce "helpful summaries" instead of continuing to work. The phrase "given the time spent" is specifically called out because it was the most common premature termination pattern in evaluation.

Multi-attempt retry with escalating instructions: When the agent stops without flags, the system automatically retries (up to 3 attempts) with progressively stronger persistence instructions appended to the custom instruction field.

4.3 Fallback Strategy Trees

The system prompt encodes structured fallback strategies organized by failure type:

Reverse shell failures: Try different shells (bash, python, php, perl, nc), encodings (URL, base64, hex), ports (80, 443, 4444), bind shells, staged payloads
No interactive shell: Write SSH keys, create cron jobs, deploy web shells, leverage existing processes
Privilege escalation stuck: SUID binaries, sudo -l, capabilities, cron jobs, writable /etc/ files, kernel exploits, credential hunting
Enumeration complete but no flags: Re-enumerate aggressively, check non-standard ports, hidden subdirectories, source code review, fuzzing, race conditions, second-order vulnerabilities
Web exploitation failures: Manual exploitation, filter bypasses, polyglot payloads, vulnerability chaining, logic flaws, deprecated API versions

These are not random lists -- they encode the actual decision trees experienced penetration testers follow when stuck. The ordering reflects likelihood of success.

4.4 Flag Detection Patterns

Regex-based flag detection runs continuously on all agent output:

FLAG_PATTERNS = [
    r"flag\{[^\}]+\}",           # flag{...}
    r"FLAG\{[^\}]+\}",           # FLAG{...}
    r"HTB\{[^\}]+\}",            # HTB{...}
    r"CTF\{[^\}]+\}",            # CTF{...}
    r"[A-Za-z0-9_]+\{[^\}]+\}",  # Generic CTF format
    r"\b[a-f0-9]{32}\b",          # 32-char hex (HTB flags)
]

Strict validation for benchmarking adds minimum content length (32+ characters) to avoid false positives from code snippets containing patterns like private{self}.

5. Benchmark Results and Capability Analysis

5.1 USENIX Security 2024 Benchmark (Original Research)

13 real-world targets from HackTheBox and VulnHub
182 total sub-tasks decomposed from these targets
Comparison: GPT-3.5 alone, GPT-4 alone, Google Bard alone, human expert (OSCP certified), and the three-module system
Result: 228.6% improvement over GPT-3.5 baseline; outperformed standalone GPT-4 by significant margin
Distinguished Artifact Award -- artifact was Available, Functional, and Reproduced

5.2 XBOW Validation Benchmark (Agentic v1.0, December 2025)

104 Docker-containerized vulnerability challenges across 15+ vulnerability categories.

Overall: 86.5% success rate (90/104)

Metric	Value
Average cost per success	$1.11
Median cost per success	$0.42
Average time per success	6.1 minutes
Median time per success	3.3 minutes
Cost-time correlation	0.96 (very strong)

By difficulty:

Level 1 (Easy): 91.1% success
Level 2 (Medium): 74.5% success
Level 3 (Hard): 62.5% success

Strongest categories: IDOR (93%), Command Injection (91%), Privilege Escalation (86%), Business Logic (86%)

Weakest categories: XSS (74%), Default Credentials (72%), and specialized attacks (HTTP Smuggling, Race Conditions -- 0%)

5.3 What the Agent Cannot Do

14 challenges remain permanently unsolved across all retry attempts. The failure patterns reveal fundamental LLM limitations:

False flag detection from code context: The agent finds htb{ strings in its own code output and declares victory. This is a prompt/parsing problem, not a capability limitation.
Blind injection attacks: Blind SQLi and blind SSTI require iterative boolean-based extraction loops that the agent struggles to maintain systematically over many iterations.
HTTP request smuggling/desync: Requires precise byte-level manipulation of HTTP requests that exceeds current LLM capability for tool orchestration.
Race conditions: Concurrency exploitation requires timing-sensitive parallel request orchestration that single-threaded agent loops cannot effectively achieve.
Multi-stage exploitation chains: Challenges requiring 4+ distinct exploitation steps in sequence (e.g., default creds -> SSTI -> encoding bypass -> flag extraction) hit context degradation.
Time-based attacks: Anything requiring careful timing analysis (time-based blind SQLi, race conditions) consistently times out before the agent can extract sufficient data.

5.4 Diminishing Returns on Retry

Critical insight from the three-run evaluation:

Run 1: 80.8% success (84/104)
Run 2 (retry failures): 25% success (5/20)
Run 3 (retry remaining): 6.7% success (1/15)

After the first attempt, retrying the same approach is almost worthless. The remaining failures require fundamentally different strategies, not persistence. This has direct implications for agent design: instead of retrying, the system should detect stuck states and pivot to completely different attack vectors.

6. Architecture Patterns for CIPHER Adoption

6.1 Event-Driven Agent-UI Decoupling

The EventBus pattern provides clean separation between agent logic and interface:

EventType.STATE_CHANGED  -- agent lifecycle transitions
EventType.MESSAGE        -- text output from agent
EventType.TOOL           -- tool start/complete with args
EventType.FLAG_FOUND     -- flag detected (for CIPHER: finding detected)
EventType.USER_COMMAND   -- pause/resume/stop from UI
EventType.USER_INPUT     -- operator instruction injection

This pattern enables:

Multiple UI frontends (TUI, CLI, headless, API) from one agent
Real-time streaming of agent activity
Non-destructive pause/resume at message boundaries
Operator instruction injection without restarting the engagement

6.2 Abstract Backend Protocol

The AgentBackend abstract class defines the minimum interface for any LLM backend:

class AgentBackend(ABC):
    async def connect() -> None
    async def disconnect() -> None
    async def query(prompt: str) -> None
    def receive_messages() -> AsyncIterator[AgentMessage]
    @property session_id -> str | None
    @property supports_resume -> bool
    async def resume(session_id: str) -> bool

Unified AgentMessage type with MessageType enum (TEXT, TOOL_START, TOOL_RESULT, RESULT, ERROR) normalizes output across different LLM frameworks.

6.3 Session Persistence Model

File-based JSON storage with the following tracked state:

Session ID (8-char UUID prefix)
Target, task, model
Status (running/paused/completed/error)
Backend session ID (for resume)
User instructions injected during session
Flags/findings found with context snippets
Cumulative cost tracking
Error state

This enables engagement suspension and resumption across sessions -- critical for long-running penetration tests.

6.4 Docker-First Isolation

Security testing environments require isolation:

Non-root user (pentester) with sudo NOPASSWD
Pre-installed tool suite: nmap, netcat, curl, wget, jq, ripgrep, tmux
VPN support (openvpn) for HTB/THM connectivity
Workspace volume mount for persistent artifacts
Claude Code CLI and Router installed globally
Config volume for LLM authentication persistence

6.5 Observability via Langfuse Integration

Production telemetry tracks:

Session metadata (target type, duration, completion status)
Tool execution patterns (which tools, not actual commands)
Flag detection events (event occurred, not flag content)
Cost and timing data per session

Opt-out via --no-telemetry flag or LANGFUSE_ENABLED=false environment variable.

7. Effective Strategies for LLM-Driven Security Testing

7.1 Task Decomposition is Everything

The single most important lesson: decompose pentesting into LLM-manageable subtasks. A raw LLM asked to "hack this machine" will flounder. A structured system that asks "given these nmap results, which service should we investigate next and what specific command should we run?" succeeds.

The Penetration Testing Tree is the mechanism for this decomposition. Each node is a concrete, actionable task with clear success/failure criteria.

7.2 Output Compression is Critical

Never feed raw tool output directly to the reasoning model. Always parse, summarize, and compress first. Key data to preserve:

Port numbers AND service versions
Specific error messages and HTTP status codes
File paths and directory structures
Credential formats and hashes
Source code patterns indicating vulnerabilities

Data to discard:

Banner noise and formatting
Repeated entries
Standard "no vulnerability found" results
Debug/verbose output that doesn't contain actionable info

7.3 Separate Strategy from Execution

The three-module split prevents a critical failure mode: the execution-focused model overriding strategic decisions. When a generation model sees the full attack tree, it tends to second-guess priorities and propose alternative strategies instead of executing the assigned task. Context restriction (only showing the selected task) keeps each module in its lane.

7.4 Persistence Beats Intelligence

On the XBOW benchmark, the fastest solves (0.9 minutes) were trivially simple. The most expensive solves ($5.56, 23+ minutes) were still successful because the agent kept trying different approaches. The 14 permanently failed challenges were not harder in absolute terms -- the agent simply lacked the specific technique required and could not discover it through iteration.

Implication: Build agents that try many different approaches rather than agents that think harder about one approach. Breadth of attack surface coverage beats depth of analysis on any single vector.

7.5 Detect Stuck States Early

With a 0.96 cost-time correlation, spending more money on a stuck agent is almost purely waste. Effective heuristics for stuck detection:

Same command executed 3+ times with identical output
No new findings in the last N tool executions
Reasoning module producing identical task trees across updates
Token consumption exceeding 2x the median for the difficulty level

When stuck is detected, the correct response is not "try harder" but "try differently" -- inject an instruction to abandon the current approach and enumerate from scratch.

7.6 Flag/Finding Detection Must Be Robust

False positives from code context are the #1 benchmark failure mode caused by the detection system itself. Production implementations need:

Minimum content length thresholds (32+ chars inside delimiters)
Context-aware filtering (ignore flags found in source code snippets, error messages, or the agent's own reasoning)
Strict vs. permissive modes for different operational contexts

8. Limitations and Open Problems

8.1 Fundamental LLM Limitations for Pentesting

Timing-sensitive attacks: Race conditions, time-based blind injection require sub-second precision that LLM-orchestrated tools cannot reliably achieve
Binary exploitation: ROP chain construction, heap exploitation, and similar memory corruption techniques require mathematical precision beyond current LLM capability
Novel vulnerability classes: LLMs can only exploit patterns they've seen in training data. Zero-day discovery requires reasoning beyond pattern matching
Stealth and evasion: Autonomous agents are inherently noisy. Every tool execution generates artifacts. There is no concept of OpSec in current implementations
Scope management: Autonomous agents have no inherent understanding of engagement scope. Without explicit constraints, they will scan/exploit anything reachable

8.2 The Autonomy-Control Tradeoff

Fully autonomous mode achieves higher benchmark scores but introduces risks:

No human judgment on scope boundaries
No validation of exploit safety (crash risk on production systems)
No contextual understanding of business impact
Cost runaway on stuck engagements

The pause/resume/inject pattern addresses this partially: operators can observe real-time activity and inject course corrections without restarting the engagement.

8.3 Multi-Step Exploitation Chains

Current architectures degrade significantly beyond 3-4 exploitation steps. Each step generates output that compresses imperfectly, and the accumulated context loss eventually causes the agent to lose track of where it is in the chain. This is the core unsolved problem.

8.4 Cost Efficiency at Scale

At $1.11 average per benchmark challenge, automated testing is remarkably cheap for individual targets. However, at enterprise scale (thousands of targets), costs become significant. The 0.96 cost-time correlation means the expensive failures are also the slow ones -- early termination heuristics are the primary lever for cost control.

9. Integration Points for CIPHER

9.1 Applicable Architecture Patterns

PTT-style state management: CIPHER's engagement context template already mirrors the PTT concept. Formalizing this as a maintained tree structure during extended engagements would improve multi-step reasoning
Output compression: CIPHER should always summarize tool output before reasoning about it, following the parsing module pattern
Event-driven status reporting: The EventBus pattern maps directly to CIPHER's agentic protocol phases (REASON, PLAN, EXECUTE, ANALYZE, LOOP, REPORT)
Session persistence: CIPHER's engagement context can be serialized/deserialized for long-running assessments

9.2 Prompt Engineering Lessons

Never-give-up directives work: Explicit anti-quitting language measurably improves completion rates
Fallback strategy trees in prompts: Pre-encoding common failure recovery strategies reduces stuck states
Category-specific knowledge: Loading domain-specific context (web vuln types, privesc techniques, crypto patterns) improves performance within that category
Structured output formats: The three-sentence task description with separator enables reliable parsing and automation

9.3 What CIPHER Does Better

The research implementation focuses narrowly on CTF/HTB flag capture. CIPHER's broader security domain coverage (RED/BLUE/PURPLE/PRIVACY/INCIDENT/ARCHITECT modes) enables:

Simultaneous offensive and defensive analysis (PURPLE layer)
Privacy impact assessment alongside exploitation (PRIVACY layer)
Detection opportunity identification during red team operations
Structured finding reports with CVSS/CWE/ATT&CK mapping
Incident response integration with evidence preservation protocols

The key advantage is not autonomy but contextual depth -- understanding not just how to exploit a vulnerability but its business impact, detection signatures, remediation priority, and regulatory implications.

Last updated: 2026-03-14 Classification: CIPHER internal knowledge -- AI-assisted security methodology