How we built Cloudflare's data platform and an AI agent on top of it

AI summary · key takeaways

• Town Lake consolidates data from dozens of production databases, ClickHouse clusters, Kafka streams, and cloud storage into a single SQL interface, eliminating the need to navigate multiple systems with different credentials and query languages • The lakehouse architecture using Apache Trino and R2 with Apache Iceberg enables cost-effective storage with automatic data compaction and retention policies, while maintaining queryability across hot and cold data • Skipper AI agent democratizes data access by translating natural language questions into SQL queries, with built-in security controls, PII detection, and time-bounded permissions for auditable access • Building on Cloudflare's own platform (R2, Workers, Access, Workflows) demonstrates internal validation of their products while solving critical data infrastructure challenges • The platform addresses both sampled data for dashboards and unsampled data for critical use cases like billing and security investigations, providing appropriate data fidelity based on query requirements

Cloudflare processes more than a billion events every second. Our network spans 330+ cities in 120+ countries. Behind every HTTP request, every Worker invocation, every R2 read operation, there is data, and a lot of it.

For years, that data was not very easy to access. It lived in dozens of production databases, ClickHouse clusters, Kafka streams, Google Cloud buckets, BigQuery datasets, and a long tail of pipelines. To answer a simple question like "How many domains that signed up today are in the Top 100 by traffic?"

, an analyst at Cloudflare had to know which system to ask, what credentials to use, what query language to write, and whether the data they were looking at was sampled, fresh, or seven-days stale. As a result, it was difficult to glean informed insights from the data. To solve this problem, we built two in-house tools: Town Lake, Cloudflare's unified data analytics platform, and Skipper, an AI data agent that runs on top of it.

Town Lake is a single SQL interface to everything Cloudflare knows, and Skipper is how anyone at Cloudflare can ask questions in plain English and get correct, auditable answers back in seconds. This is the story of how we built both. The shape of the problem If you have ever worked at a company that went through a hyper-growth period, you know what data sprawl looks like.

Ours had a few specific symptoms: Too many disparate systems. A product engineer who wanted to investigate a customer issue might need to query Postgres for account metadata, ClickHouse for analytics events, BigQuery for usage rollups, R2 for raw logs, and Kafka topics for real-time signals. Each system had its own credentials, its own language, and its own retention policy.

Sampled data. This is fine for dashboards, but doesn’t work for domains like billing. Our analytics pipeline downsamples to handle 700M+ events per second.

That is the right behavior when you want an analytics dashboard to load, but it’s exactly the wrong behavior when you are trying to compute someone’s usage required to issue an invoice. External dependencies for internal data. Parts of our previous internal reporting stack were powered by external vendors.

Beyond the cost, we had a hard external dependency on another cloud for some of our critical data. No one could find the data. Even if you had all the right credentials, you needed to know that the right table for "Billable Workers requests by account" lived in a specific ClickHouse cluster, in a specific schema, joined to a specific Postgres dimension table, and that the join required an obscure customer ID translation.

There was too much tribal knowledge. We had a cultural challenge too: data infrastructure had historically been treated as a back-office function that was in service of the business, rather than critical infrastructure in its own right. What we wanted We wanted to create one place where anyone at the company with appropriate permissions and a need to know could get answers to questions about Cloudflare: “Show me the top 100 customers by revenue in the last quarter”, “List all Bot Management ML scoring events with score > 0.

9 in the last 48 hours coming from a specific ASN”, “Find the Top 100 billing support tickets from customers who have spent >$100”, etc. We wanted that place to give fresh, accurate, unsampled data for the queries that need it (like billing or security investigations) and fast, downsampled data for the queries that don't (like dashboards or exploration). We wanted security and governance baked in, with personally identifiable information (PII) detected automatically, and sensitive tables locked down by default.

All access should be auditable, and have time-bounded permission grants so that users could only access data when they were actively working on tasks that required it. We wanted it to be built on Cloudflare’s own platform: R2 for storage, Workers for compute, Cloudflare Access for authentication, Workflows for orchestration. If we were going to make a major investment in our data infrastructure, it was going to be built on the same products we sell to customers.

And we wanted, eventually, an interface that did not require knowing any SQL. The goal was to empower anyone at the company with appropriate permissions and a need to know to look at the stream of data flowing through our network, not just analysts. That last requirement is what became Skipper.

Town Lake, the platform At its core, our data platform’s architecture is a data lakehouse : a query engine that reads from object storage, with a metadata layer that makes the storage behave like a database. We call it Town Lake, after its namesake in Austin, Texas. Its most important components are: Query engine.

We chose Apache Trino for that: a single SQL query can join a Postgres table, a ClickHouse table, and an Iceberg table on R2 without a need to materialize the intermediate results into a different system. A query that asks "what are the top 100 paying customers by Workers requests this week" compiles into a plan that pushes filters into ClickHouse, joins against an account dimension in Postgres, and ranks against billing rollups in R2, all in one go.

R2 Data Catalog, our managed Apache Iceberg service, is where the cold and warm data lives. Iceberg gives us schema evolution, time travel, partition evolution, and the ability to compact data as it ages. Per-minute usage from last week becomes hourly, hourly from last quarter becomes daily, etc.

The storage cost decreases as recency does, while the data stays queryable. Parquet files in R2 are much cheaper compared to keeping the same data in an OLAP database. DataHub is our metadata catalog.

Every table, column, owner, lineage edge, and glossary term lives there. When a user asks "what's in townlake. dim.

accounts ," DataHub provides an answer, including the table description, the column descriptions, the owning team, the upstream tables that feed it, and the downstream tables that consume it.

Originally published at blog.cloudflare.com

#Ai Agents #Cloudflare #Data-Analytics #Data-Lakehouse #Edge Compute #Enterprise-Architecture #Security

How we built Cloudflare's data platform and an AI agent on top of it

Talk to an architect about applying this to your stack.

More from the journal

Welcome to Agents Week

An API for MoQ: provision your own isolated relays

Dogfooding at scale: migrating cdnjs to Cloudflare’s Developer Platform