How is agentic AI pentesting different from a vulnerability scanner?

A scanner runs a fixed checklist and alerts on signature matches. Agentic pentesting uses LLM agents that reason about context, plan attack chains, and validate findings with working proofs of concept. Scanners alert; agents prove.

Does Fleuret AI run on OpenAI or Anthropic models?

No. Fleuret runs on open-weight models (gpt-oss-120b, gpt-oss-20b, Kimi K2.5) hosted on Scaleway in France. No customer data leaves EU jurisdiction and there is no CLOUD Act exposure for DORA-regulated workloads.

Can agentic pentesting cover internal networks or Active Directory?

Not in 2026. Strong coverage today: web applications, REST and GraphQL APIs, external infrastructure. Still developing: Active Directory, social engineering, multi-actor industrial logic. Pair Fleuret with one annual human engagement for those scopes.

What does a Fleuret pentest report contain?

Scope, methodology, findings with reproduction steps and screenshots, remediation guidance, retest plan. Plus machine-readable JSON findings, Ed25519-signed evidence, native Jira tickets, and DORA Article 24 + NIS2 Annex I audit PDF mappings shipped by default.

How does the Coverage Graph improve testing quality?

The Coverage Graph tracks what has been seen, tested, and remains unexplored. It is the termination oracle: testing stops when the graph is saturated, not when a timer expires. This separates depth from cost and shows auditors completeness, not just findings.

Agentic AI pentesting: how autonomous agents test web apps

What "agentic" means here

A scanner runs a fixed list of checks against a target. An agent decides what to do next, based on what it just observed. The difference matters: real intrusions are sequences of reasoning, not signature matches.

An agentic pentest system is a coordinated set of LLM-powered agents, each specialised, that share a model of the target and take turns deciding the next attack step. It is closer in shape to a junior red team led by a senior than to a Nessus scan.

The core architecture

Three layers.

1. The discovery layer. Crawlers, fuzzers, and recon agents map the attack surface: endpoints, parameters, subdomains, authentication flows, third-party libraries, error patterns. Output is a structured representation of the target. We call ours the Coverage Graph: a hierarchical data structure that tracks what has been seen, what has been tested, and what remains unexplored.

2. The reasoning layer. A planning agent reads the Coverage Graph and proposes attack chains. "This endpoint accepts a UUID and returns user data, the auth token comes from a JWT signed with HS256, the session resumes via a refresh token that is not bound to the device. Try a horizontal IDOR plus a refresh-token replay." That is reasoning, not a rule.

3. The execution and validation layer. Specialised agents execute: an injection agent, an auth agent, a logic-flaw agent, an SSRF agent. Each one tries the planned attack, observes the response, refines, and either validates the finding with a working PoC or marks the hypothesis as failed. No PoC, no finding.

A scanner alerts. A pentester proves. Agentic systems prove because validation is part of the loop, not a downstream step.

Why open-weight models matter

Many AI pentest systems wrap a closed-model API. That works in a demo, fails in regulated environments. Three reasons:

Data residency. Sending source code or production traffic samples to a US-hosted model breaks DORA's data-localisation expectations and most NIS2 critical-infrastructure operator policies.
Cost at scale. Closed-model token costs make per-engagement economics impossible at the €4,000 per test price point. Open-weight inference on dedicated GPUs runs at €20 to €25 of compute per pentest.
Fine-tuning. A pentest agent gets better when you train on real engagement traces. You cannot do this on a closed API.

We run on open-weight models (gpt-oss-120b, gpt-oss-20b, Kimi K2.5) hosted on Scaleway in France. Sovereign by construction.

What it does well, where it is still developing

Strong in 2026:

Web application logic (auth, authorization, IDOR, injection families, business-logic chains on standard CRUD).
REST and GraphQL APIs (introspection, broken object-level auth, mass assignment).
External infrastructure (subdomain takeover, exposed services, misconfigured TLS).

Still developing:

Active Directory and Kerberos-heavy internal network.
Social engineering and phishing simulations.
Bespoke business logic on multi-actor industrial workflows.

Honest map. Pair the continuous AI layer with one annual human engagement for the third bucket.

What you get out

A pentest report that looks like a senior consultant wrote it: scope, methodology, findings, reproduction steps, screenshots, remediation, retest plan. Plus the things a human cannot ship: structured machine-readable findings, signed evidence, integration with Jira and your compliance platform, weekly cadence at marginal cost.

If you want to see one running on your own surface, request a demo.

Automated vs manual pentesting: where each one wins, by surface and depth.
Pentest cost in Europe 2026: why agentic pricing is 10x lower at comparable depth.
Why annual pentests are broken: the case for continuous testing on SaaS.
Sovereign EU AI pentest: CLOUD Act, Schrems II, and the EU AI Act in 2026.

Agentic AI pentesting: how autonomous agents test web apps

What "agentic" means here

The core architecture

Why open-weight models matter

What it does well, where it is still developing

What you get out

The Fleuret newsletter

Privacy Settings

What "agentic" means here

The core architecture

Why open-weight models matter

What it does well, where it is still developing

What you get out

Related reading

The Fleuret newsletter

Privacy Settings