Introducing the next generation of AWS Resilience Hub for generative AI-based SRE resilience journey

AI summary · key takeaways

• New modular resilience policy framework allows organizations to define composable requirements including SLOs, multi-AZ/multi-Region disaster recovery, and data recovery objectives rather than rigid policy types • Generative AI-powered failure mode assessments analyze services against resilience policies, AWS Well-Architected best practices, and the AWS Resilience Analysis Framework to identify potential failures and provide actionable recommendations • Automated dependency discovery uses DNS query log analysis to identify unknown dependencies including unexpected cross-region calls and critical third-party endpoints • AWS Organizations integration enables centralized resilience management and reporting across enterprise accounts from a single delegated administrator account • Business-level application modeling maps critical end-user paths to business outcomes through systems, user journeys, and services with automatic topology discovery

Today, we’re announcing the next generation of AWS Resilience Hub with a significantly expanded experience that brings together a new application model, dependency discovery assessment, generative AI-powered failure mode analysis, modular resilience policies, and organization-wide reporting. Organizations running hundreds of applications share a common challenge: availability is a top concern, yet there is no consistent way to set resilience goals, measure progress, or prove compliance across a portfolio.

Teams set different standards, use different tools, and struggle to exchange information about whether applications actually meet expectations. The next generation of AWS Resilience Hub changes this by giving Site Reliability Engineers (SREs) and development teams a structured way to align on resilience policy expectations, help application teams achieve them, and demonstrate compliance through testing.

With integration into A WS Organizations , teams can now evaluate resilience at scale, identify failure modes, discover hidden dependencies, and report on progress across the enterprise. The next generation of Resilience Hub walks you through your resilience journey and to help you there are the following concepts built into it. Resilience policy : You can define your resilience expectations through modular, composable requirements.

Rather than choosing a single rigid policy type, you construct policies by selecting the requirements that matter to your application, such as service level objective (SLO), multi-AZ and multi-Region disaster recovery, and data recovery requirements. Business-level understanding : You can use new application modeling through critical end-user paths that map directly to business outcomes.

Systems represent a business application, user journeys describe critical business paths, and services are the deployable units comprising AWS resources, code, and observability. Resilience Hub automatically discovers and maps them into a topology showing how resources connect. AI failure mode assessments : You can run generative AI-powered assessments that analyze your services against your defined resilience policies, AWS Well-Architected best practices, and the AWS Resilience Analysis Framework .

These assessments identify potential failure modes and provide actionable recommendations. Dependency discovery assessment : You can automatically discover AWS services, internal endpoints, and third-party endpoints that your services depend on. This dependency assessment uses DNS query log analysis to identify dependencies you may not know about—including unexpected cross-region calls or critical third-party dependencies.

The next generation of AWS Resilience Hub in action To get started, you configure a resilience policy, set up your first system and service, run a failure mode assessment, review the results, and implement the findings. Before you begin, you should set up the invoker IAM role, which grants Resilience Hub read-only access to your AWS resources, cross-account roles (if not using AWS Organizations), or service-linked roles (SLRs) with AWS Organizations.

Resilience Hub also integrates with AWS Organizations to enable organization-wide resilience management from a single delegated administrator account. This eliminates the need to log in to individual accounts to assess resilience posture across your enterprise. To learn more, visit For prerequisite details in the AWS Resilience Hub User Guide.

To configure a resilience policy, choose Create policy in the Policies menu through the AWS Resilience Hub console . Enter a policy name, description, and choose resilience requirements. For example, you can create a reusable policy for multi-Region disaster recovery used in financial applications—including 99.

95% availability SLO, 15-minutes RTO, 5-minutes RPO for multi-Region disaster recovery, and disaster recovery approach that aligns with your RTO and RPO requirements. If you choose data recovery requirements, you can define the data recovery time objective for restoring from backups for each service associated with this policy. To create your first system representing your business application, choose Create a system in the Systems menu.

Optionally, you can enable AWS Organizations account access for this system. Now you can create a service that represents a deployable unit, like one of your microservices, and associate it with your system, and tell Resilience Hub where to find your resources. Enter a service name, for example, stock-exchange-service , choose your resilience policy and invoker AWS IAM role name.

You can choose service Regions, service resources such as your resource tags, AWS CloudFormation stack, Terraform state file location, or Amazon EKS cluster and namespace. When you enable dependency discovery for this service, AWS examines your VPC query logs for the VPCs associated with the resources in your service. You can disable this feature anytime from the dependency discovery settings in the service details page.

Now, you can run your first assessment with the service creation complete and a policy applied. Choose Run failure mode assessment in your service page and wait for the assessment to complete. During the assessment, Resilience Hub assumes your invoker role, reads resources from your configured input sources, identifies parent-child relationships, queries the application topology service to map connections between resources, and builds a topology showing data flow, containment, and permissions.

By choosing Service topology , you can see service resources grouped by service functions in the graph, table, or JSON format. By choosing Failure mode guidance , you can add assertions used to guide the agents while performing the failure mode assessment. Assertions are either generated by the agent or added by users.

You can update them to improve assessment accuracy.

Originally published at aws.amazon.com

#Aws #Cloud #Disaster Recovery #Generative Ai #Infrastructure #Site Reliability Engineering

Introducing the next generation of AWS Resilience Hub for generative AI-based SRE resilience journey

Talk to an architect about applying this to your stack.

More from the journal

How the 2026 World Cup affected Internet traffic

Cloudflare Internal DNS is now generally available

AWS Weekly Roundup: One-click Lambda setup prompt, OpenAI GPT-5.6 models on Bedrock, and more (July 20, 2026)