Utopia Tech
Engineering4 min read

Build your own vulnerability harness

A few weeks ago, we published our initial findings from Project Glasswing , looking at what happens when you point frontier security models at an enterprise codebase. We also explored how our defensive structures adapt to protect our infrastructure and customers from threats posed by frontier AI . Since then, the AI ecosystem has continued to shift rapidly — developers who've b

UT

Utopia Tech

June 18, 2026 · 4 min read

Share

A few weeks ago, we published our initial findings from Project Glasswing , looking at what happens when you point frontier security models at an enterprise codebase. We also explored how our defensive structures adapt to protect our infrastructure and customers from threats posed by frontier AI . Since then, the AI ecosystem has continued to shift rapidly — developers who've built tightly around a single model have already experienced what happens when that model is no longer available or gets superseded by a more capable one.

These market shifts only reinforce our core thesis: no matter which underlying model is leading the pack on any given day, the future of agentic workflows will not be found in standalone models, prompts, or single-agent sessions. Moving from a localized security "skill" to a continuous, fleet-wide scanning pipeline requires an architecture where models are treated as interchangeable components.

Relying on a single model inherently limits defensive coverage, as the same system will tend to look at code paths through the exact same lens. To counter this, models should be frequently interchanged and cross-tested. By varying the models across the pipeline — such as using one model for initial discovery and an entirely different one for validation — we can ensure that vulnerabilities are cross-checked by distinct sets of logic.

Furthermore, a true enterprise-scale harness must look beyond isolated repositories to trace vulnerabilities across cross-repo dependencies, ultimately filtering thousands of raw candidates down to a trusted, triaged queue of actionable fixes. This post serves as a practical look at how to build that model-agnostic layer, focusing on how we manage state controls, eliminate false positives, and coordinate end-to-end triage at scale.

Two objections, up front The first post made the case for why generic coding agents can't do this job. The main issue is that agents only hold one hypothesis at a time, fill their context window after covering a sliver of a real repo, and then lose information during context compaction. For more details, read that post .

Before we move forward, we would like to answer two likely questions. "Why not use subagents instead of a harness?" Subagents are useful, and they are a good starting point.

But security analysis needs hundreds of separate investigations that survive across runs, don't share a context window, and can be re-scoped and cross-referenced later. It needs persistence, deduplication, resumability, and eventually fleet-wide dependency tracing. That's an orchestration problem, and a prompt can't get you there.

"Is this blog post just an ad for frontier models?" No. Our approach centers on the harness, not the model.

When it comes to vulnerability discovery, we run it with whatever frontier model is currently best at what we need. When we point different models at the same target, they each turn up a different share of the bugs. The harness is the bit that lasts.

If you build your own system, design it to be model-agnostic from day one. This will allow you the freedom to use any model of choice without constraints. It all starts with a skill We started with a ~450-line security-audit skill that we ran on a single repository, and adjusted the prompts until we surfaced real bugs.

Later, we added the orchestration that became the plumbing of the entire system. The real value lives in the prompts themselves, and our prompts continue to carry the initial skill's attacker scenarios, bug classes, and anti-pattern detections nearly unchanged. The skill was written to run a 7-phase audit in one session: Three parallel research agents do recon and write an architecture.

md . One Hunter agent runs per class attack, trying to break the code rather than review it. Adversarial validators try to disprove each finding.

The survivors are written up as a human-readable vulnerability report. They're also emitted as findings. json against a schema, and a mechanical check validates that file.

Finally, a fresh agent independently re-verifies every finding against the source. The surviving, re-verified findings are submitted to the ingest API. That first skill maps almost directly onto the later harness: Skill phase Harness stage Recon agents write architecture.

md Recon Hunters run per attack class Hunt Validators disprove findings Validate Surviving findings become a report Report findings. json is checked mechanically for schema adherence, not correctness Mechanical validation of line numbers and functions in findings Fresh agent re-verifies findings Independent validation The skill worked, but it quickly revealed its limits.

Looking at the coverage metrics, a single run finds only about half the bugs you'd catch across multiple runs. In our experience the ones it did find skewed toward the simpler and less subtle. Once your process is basically "run it ten times and diff by hand," you probably need to start looking at a real harness.

While running and fine-tuning the skill, we ran into three walls: Context exhaustion : An hour in, the context window fills up and the model will cannibalize its own memory, instantly forgetting the bugs it spent all morning tracking down. We broke this bottleneck by externalizing the state entirely, treating the LLM as a stateless compute engine. Persistence : A crash mid-run means starting over.

Losing hours of work to one AI rate-limit error or connection flakiness is an incredibly expensive way to realize you need a better architecture. Cross-repo reasoning : A single repo session is completely blind to the relationships between applications that consume it, and the number of bugs that surface when you inspect the interface between components is probably more than one might expect.

ADVICE: A real but minimal harness consists of just Recon, Hunt, and Validate stages kept in a database, alongside a separate Validator that can't file its own findings.

Originally published at blog.cloudflare.com

Share
▸ Want a deeper look?

Talk to an architect about applying this to your stack.

60-minute technical evaluation, no obligation. We'll map the ideas in this article to your environment.

Skip to main content