Security Insights provides actionable security recommendations for every Cloudflare account. To find these insights, we perform regular scans for all accounts, zones, and DNS records, looking for potential security risks and misconfigurations. However, two key issues emerged.
First, our scans were too infrequent. Scans were only being performed every week or two, and therefore newly introduced security risks could remain undetected for up to two weeks. Second, automatic scanning was opt-in for many free plan accounts – meaning lots of accounts weren’t being scanned at all.
The risks of infrequent or nonexistent scans are rising: as automated attacks accelerate, the window for detecting security misconfigurations is shrinking. Making sure that we’re finding these issues for all of our customers is crucial to our aim of building a better Internet for everyone. We calculated that to increase our scanning frequencies and enable automatic scanning for all accounts, we would need to increase our scanning throughput by around 10x on average – from 10 scans per second to 100 per second.
But our system was already struggling with its load: millions of events were filling up our backlog waiting to be processed; our API was frequently timing out; our processes were crashing. We needed to fix our system, and we needed to make it scale . This is the story of how we increased scanning throughput for Security Insights by more than 10x, enabled security insights for millions of customers, and doubled our scanning frequency for all customers.
Read on to find out how we achieved these improvements. How we scan for security insights At a high level, our automatic security scans are triggered by a scheduler. When an account or zone is due for a scan, the scheduler publishes a message (or messages) to Apache Kafka , an open-source distributed event streaming platform.
These messages fan out to a number of checkers: specialized Go microservices that scan specific assets or configurations. For every message, each checker sends its results (the security insights that it found) to our internal API, which then persists these in a Postgres database. Making it scale Scaling Kafka Apache Kafka is not strictly a queue : it is a partitioned event stream (though recently gained queue semantics ).
Within a partition, messages must be consumed and processed in order. This differs from typical queues where messages may be consumed in order but are processed out-of-order. As a result, we can only have one active consumer per partition within a consumer group .
This has two consequences for us: Messages that are slow to process block the consumer from progressing to the next message For each checker, we can only have as many consumers as there are partitions (each checker has its own consumer group) We could have tried to scale by adding more partitions. However, this would have increased resource usage for the Kafka broker itself, which is shared by many other services.
We reserved this as a last resort, aiming to improve our code and architecture first. Introducing parallel processing Although we can only consume messages in order, there is nothing stopping us from consuming multiple messages at once. We changed our checkers to consume messages in batches , processing each message in a separate goroutine.
The trade-offs are that we’d have more work to re-do if our process crashed midway through a batch, and our memory usage would be slightly increased. In our case, these were both acceptable. Avoiding head-of-line blocking Some messages processed by a few of our checkers take much longer to process than others.
For example, one account/zone may have far more assets than another. In the worst case, these messages can take minutes or hours to process compared to the average case of seconds or milliseconds. We opted for a very simple approach: splitting our consumer groups and checkers in two – the ‘slow lane’ and the ‘fast lane’.
We could determine quickly whether a message would be slow or fast to process. If the ‘fast lane’ checker encounters a slow message, it skips it. This solved the problem: slow messages had the dedicated resources and time to be processed with minimal delay, and fast messages were able to proceed at their regular fast pace.
Optimizing our database queries Every insight we find gets written to our Postgres database. This is handled by a single API endpoint that our checkers invoke with a list of insights. The implementation looked like this: for _, issue := range issues { _, err = tx.
Exec(ctx, `INSERT INTO table ... VALUES ($1, $2, ...) ON CONFLICT DO UPDATE ...
`, ...) if err ! = nil { return err } } The astute reader will notice that for large sets of insights, this code makes a round trip to the database per insight.
With a maximum observed size of 500,000, this was half a million round trips, queries, and transactions in a single API call. We initially tried the gold standard for bulk inserts in Postgres: COPY into a temporary table. However, we found that this approach led to bloat in the Postgres system tables.
We settled on a hybrid approach: Using UNNEST when the number of issues was below a threshold Using COPY when the number of issues exceeded this threshold This provided the best of both worlds: reasonably fast inserts for huge sets of insights (seconds), and even faster inserts (milliseconds) for small sets of insights. Investigating our API timeouts We noticed several strange behaviours in our internal API as we tried to scale: A large number of requests were triggering client-side timeouts Many checkers were spending 20-90% of their processing time on a single API call When triggering a large volume of scans, our throughput would start high and deteriorate All of these problems had the same root cause: latency .
Our primary database is located in Portland, Oregon. Our API, however, was running active-active in both Portland and Amsterdam.
Originally published at blog.cloudflare.com


