Utopia Tech
Engineering4 min read

Proving application resilience on Azure with Chaos Studio

Takeaway: Azure Chaos Studio helps organizations validate application resilience by simulating outages, failovers, network disruptions, and infrastructure failures before they impact production. You don’t know with certainty that your application is resilient until that resilience is tested. Better to learn it isn’t by deliberately breaking it in a test environment and watching

UT

Utopia Tech

July 2, 2026 · 4 min read

Share

Takeaway: Azure Chaos Studio helps organizations validate application resilience by simulating outages, failovers, network disruptions, and infrastructure failures before they impact production. You don’t know with certainty that your application is resilient until that resilience is tested. Better to learn it isn’t by deliberately breaking it in a test environment and watching how it reacts, than by a failure in production.

Azure Chaos Studio is our managed service for doing exactly that, safely and on purpose. Today, Azure Chaos Studio Workspaces is in public preview: a scenario-focused approach that lets you test the failure modes Azure customers actually see in production. We’ve been hard at work making Workspaces easy to use, with broad fault support and named scenarios that mirror real outages, instead of isolated faults.

Explore Azure Chaos Studio Workspaces Why designing for resilience isn’t enough Azure customers have invested in resilient design: multi-zone deployments, geo-redundant storage, automatic database failover, retry logic, load-balanced front ends. However, the real question is when an incident begins: when the failure arrives, do those mechanisms recover your application in the time you assumed they would?

Real outages don’t read the architecture diagram. A zone-redundant deployment can fail because a health probe was misconfigured years ago. A database with automatic failover can leave the application dead because a connection string is hard coded to a single region.

Geo-redundant storage can briefly produce stale reads the application code never expected. These mistakes are common, and they only show up when the failure happens. Reliability and resiliency on Azure are a shared responsibility .

Microsoft is responsible for the platform and the resilience built into Azure services . Customers are responsible for configuring that resilience and the code that uses it. No layer makes up for a gap in another.

The only way to know whether your architecture, configuration, and application logic will hold up in production is to prove they hold under failure before an outage tests them for you. How Chaos Studio Workspaces changes resilience testing Chaos Studio is Azure’s managed chaos engineering service for validating how applications behave under failure. By simulating controlled disruptions across infrastructure, networking, databases, and application dependencies, it helps teams uncover resilience gaps before customers experience them.

Chaos Studio Workspaces focuses on scenarios that match what happens in production, so you start from a real outage pattern instead of assembling individual faults. You begin with a named scenario like Zone Down, DNS Outage, or SQL failover , already sequenced against the resources in a Workspace. Most outages exercise two layers at once.

There’s the platform layer: did the service come back, did failover complete within your Recovery Time Objective, did traffic reroute. And there’s the application layer: did your code maintain data integrity, pick up in-flight transactions, retry the right things, degrade gracefully. A chaos test that only stops a Virtual Machine (VM) tells you about the platform layer.

The scenarios in Chaos Studio Workspaces are designed to validate the entire stack. Workspaces reduce the burden of getting started. The most common reason resilience testing stalls is that teams don’t know where to start.

The Workspace is the new top-level resource: you point it at a subscription or resource group, and its managed identity discovers what’s in scope and recommends the scenarios that apply. Those scenarios show up inside the Workspace, ready to configure and run, and a refresh, updates the recommendations whenever your infrastructure changes. A library of real outage scenarios.

Chaos Studio Workspaces ships with curated scenarios informed by patterns observed in real Azure incidents, so the patterns you test against are the patterns customers actually experience. Think of these as resilience templates, a fast path to the failure modes most teams need to test, and when you need something different, design your own from the same fault library.

Available today: Availability Zone Down: Virtual Machine Scale Sets (VMSS) shutdown with per-zone targeting to validate cross-zone routing and recovery. Availability Zone Down and Database failover: Compute Zone Down composed with Azure Database for PostgreSQL (Flexible Server) failover, to observe failover behavior against your configured recovery objectives and application-side connection handling.

DNS Outage: a full DNS resolution outage via NSG rules that block resolver traffic, to validate how your application behaves when name resolution fails. Microsoft Entra ID Outage: identity-provider failure that exercises authentication retry, token caching, and fallback paths. Cache Stampede: Redis flush combined with database restart and an App Service process crash, to validate behavior under a cache-miss storm and the resulting database surge.

The App Service process-crash variant currently supports Windows App Service plans. Event-Driven Messaging Disruption: Azure Service Bus and Event Hubs disable, to validate dead-letter handling and backpressure. Behind every scenario are granular API-level actions built for Workspaces : Zonal VMSS shutdown App Service process kill Force-failover for Azure Database for PostgreSQ L (Flexible Server) Azure Managed Redis flush Network Security Group (NSG)-based network controls Each scenario composes the right faults automatically.

And when a curated scenario doesn’t match your workload, you can build your own. The new Scenario Designer is a drag-and-drop experience in the Azure portal for composing any of these faults into a custom scenario arranging steps, branches, and faults with the same flexibility as classic Chaos Studio experiments, now available directly inside Workspaces.

Originally published at azure.microsoft.com

Share
▸ Want a deeper look?

Talk to an architect about applying this to your stack.

60-minute technical evaluation, no obligation. We'll map the ideas in this article to your environment.

Skip to main content