Beyond Backups: Architecture That Doesn’t Blink

During the October 20th AWS outage, not one Control Plane customer went down. Here’s the architecture that made that possible, and how to assess your own.

A tree still growing after it was cut

At 3:11 a.m. ET on October 20, 2025, a DNS race condition inside AWS’s DynamoDB management system deleted a regional endpoint record. Not a cyberattack. Not a hardware failure. A timing issue between two automated components overwrote a newer DNS configuration with a stale one, and then the cleanup automation deleted it entirely. Within minutes, anything in US-East-1 that depended on DynamoDB couldn’t find it. That meant EC2. That meant Lambda. That meant the load balancers were trying to route around the problem. 113 AWS services were affected across more than 15 hours.

Snapchat endured nearly 12 hours of global login failures. Signal went dark. Disney+, Fortnite, Coinbase, Duolingo, Pinterest, and Alexa followed. For the companies behind those names, the outage stopped being a technology problem by sunrise. It became a revenue problem, a customer-trust problem, and, in some cases, a board-level conversation nobody wanted to have.

What made it worse: services with multi-cloud failover, like Google’s apps and Meta’s platforms, remained largely unaffected. The outage didn’t hit everyone equally. It hit the unprepared.

At a median of $1.4 million per hour of downtime, a 12-hour outage isn’t a line item. And the failure mode wasn’t exotic; a race condition cascading through tightly coupled dependencies is a known architectural risk. The architecture made it catastrophic.

High Availability, Disaster Recovery, and Business Continuity are often treated as checkboxes. In practice, they are the decisions that determine whether your business absorbs a moment like October 20th, or becomes part of the outage headline.

Resilience Is Not One Thing

HA, DR, and BC are often used interchangeably. That’s an architectural mistake. They address fundamentally different failure modes, operate on different timescales, and failing to distinguish them leads organizations to end up with systems that are expensive to run but brittle under real pressure.

High Availability (HA): Absorbing Failure in Real Time

High Availability is about keeping systems running while something is actively going wrong. A node dies, a zone degrades, a service becomes unresponsive; HA architecture absorbs those events without users noticing. The keyword is absorbing. HA isn’t about preventing failure; it’s about designing systems where failure is a normal operating condition, not an emergency.

The most common HA mistake is confusing redundancy with resilience. Adding a standby node in the same availability zone doesn’t make you highly available; it gives you two points of failure in the same blast radius. True HA requires distributing workloads across independent failure domains and choosing active-active over active-passive wherever possible. In an active-passive setup, your standby infrastructure sits idle until needed, and when it’s needed most, during a failure, it has to cold-start under pressure. Active-active means every node is already carrying live traffic. Failure doesn’t trigger a switchover; it’s already absorbed before the pager goes off.

During the October 20th AWS outage, this distinction played out in real time. Services running active-active across multiple regions kept serving traffic. Services relying on a warm standby in a secondary region found that spinning it up under a cascading failure was slower, messier, and less reliable than their runbooks had assumed.

Disaster Recovery (DR): Coming Back From Something You Couldn’t Absorb

Some failures exceed what HA can mask. Silent data corruption that propagates across replicas before anyone notices. Ransomware that encrypts not just your data, but your most recent backups. Disaster Recovery exists for exactly these scenarios, and its quality is measured in two numbers: how much data you can afford to lose (RPO – Recovery Point Objective) and how long you can afford to be down (RTO – Recovery Time Objective).

The architecture decisions that determine those numbers are made long before any incident. Continuous replication keeps RPO tight by mirroring writes across locations in near real time rather than relying on periodic snapshots alone. Immutable, versioned snapshots are what make ransomware recovery viable; without them, you risk restoring a clean backup of already-compromised data. And critically, recovery paths need to be tested under realistic conditions, not just documented.

Most organizations discover their real RPO and RTO during an incident, not before it. The gap between the numbers in the DR plan and the numbers on the incident timeline is where businesses get hurt.

Business Continuity (BC): Keeping the Organization Running

Business Continuity is broader than infrastructure uptime. It’s the question of whether the organization can still function — financially, operationally, legally — during a prolonged disruption. A company can have excellent HA and DR and still fail at BC if its payment systems, customer communications, or compliance obligations depend on infrastructure that isn’t covered by the same resilience model.

The architecture principle here is eliminating provider-level single points of failure. Multi-cloud distribution across independent providers, combined with global load balancing that routes users to the nearest operational environment, means a systemic failure at one provider doesn’t become a systemic failure for your business. It’s redundancy at the business level, not just the infrastructure level.

In December 2021, the municipality of Kalix, Sweden, was hit by a significant ransomware attack. Essential services were restored quickly, and operational disruption was limited. That wasn’t due to an unusually fast response team, but because their recovery mechanisms were decentralized and regularly tested. Systems could be restored without rebuilding infrastructure from scratch. Recovery was procedural, not heroic.

That principle scales: the organizations that fared best on October 20th weren’t the ones with the best incident runbooks. They were the ones whose architectures had already assumed that the failure would happen.

The Resilience Architecture Scorecard

Use this as a self-assessment. For each capability, ask honestly whether your stack has it today, not in the roadmap, not partially, but actually in production and tested.

High Availability

(the failure you absorb before anyone notices)

Capability	What it means
Active-active multi-zone deployment	All nodes carry live traffic across zones simultaneously. No idle standbys. Failure is absorbed without a switchover.
No single point of failure in the critical path	Every component, compute, storage, DNS, load balancers, has been audited and has a redundant counterpart in a different failure domain.
Automated self-healing with no human dependency	Node failure triggers automatic replacement and traffic rerouting. Recovery doesn’t wait for an engineer.
Multi-region traffic distribution	A full regional failure results in automatic rerouting to another region. No manual intervention required.
Defined and tested availability SLOs	Uptime targets are architecturally enforced and tested — not aspirational numbers on a slide.

Disaster Recovery

(the failure HA couldn’t absorb)

Capability	What it means
Continuous replication with a defined RPO	Data is mirrored in near real time. Your RPO is measured under load, not assumed from vendor documentation.
Immutable, versioned snapshots	Backups cannot be encrypted or deleted by ransomware. Multiple restore points exist. Recovery from a clean state is always possible.
Tested RTO, not estimated	Recovery time has been validated through an actual DR exercise in the last 12 months, not calculated theoretically.
Ransomware-aware recovery path	Before restoring, integrity is verified against a known-clean baseline. The restore process assumes recent backups may be compromised.

Business Continuity

(the organization stays operational, not just the infrastructure)

Capability	What it means
Multi-cloud or provider-independent architecture	Critical workloads are not fully dependent on a single provider. A systemic AWS, Azure, or GCP failure does not bring the business down.
Predictive capacity management	Scaling decisions are made ahead of demand spikes, not in response to degradation. Capacity is ready before traffic surges.
Zero-trust security integrated into availability	Security is treated as an availability concern. A breach is a downtime event. Anomaly detection can isolate compromised nodes automatically.
Documented resilience evidence	Architecture diagrams, DR exercise logs, and recovery outcomes exist and are current. Auditors and insurers can verify resilience on request, and increasingly, cyber insurers are requiring it as a condition of coverage.

A useful benchmark: if you can check every item in the HA column, you likely would have stayed online on October 20th. If you can check DR and BC as well, you’re in the category of organizations that treat outages as a normal operating condition, not an emergency.

How Control Plane’s Architecture Checks Every Box

Every capability in the scorecard above represents an architectural decision that takes time, expertise, and ongoing maintenance to get right. Active-active deployments need to be designed from the ground up; retrofitting them onto an existing architecture is expensive and error-prone. Immutable snapshots need to be isolated from the credentials that ransomware would compromise first. Multi-cloud failover needs to be tested under realistic conditions, not just documented in a runbook.

Most engineering teams build these capabilities incrementally, under pressure, after something has already gone wrong. Control Plane was built with all of them as defaults.

We know this held up in practice. During the October 20th AWS outage, 15 hours, 113 services, one of the most disruptive cloud incidents in recent memory, not a single Control Plane customer experienced downtime. No incident bridge. No emergency scaling. No frantic runbook execution. The architecture absorbed it, the way it was designed to.

20% of our customers had AWS US-East-1 locations within their Global Virtual Cloud, Control Plane’s multi-region deployment environment. Once the architecture detected the outage, the failover was automatic. The workloads were routed to the nearest healthy region. Workloads were active and running again within 10 seconds.

HA is the baseline, not an upgrade. Control Plane’s Kubernetes foundation runs active-active across availability zones and regions from day one. No idle standby nodes waiting to cold-start during a failure. Every node carries live traffic, which means failure is absorbed continuously rather than triggering a recovery sequence. The October 20th scenario, cascading failure through tightly coupled dependencies in a single region, is exactly what this architecture is designed to make irrelevant.

DR is built into how data moves, not bolted on after the fact. Writes are mirrored across locations in near real time, keeping RPO tight without requiring a separate replication layer to manage. Snapshots are immutable and versioned, so a ransomware event that compromises production has no path to the recovery points. Integrity checks are part of the restore process, not an afterthought that engineers add after realizing they’ve restored a clean copy of compromised data.

BC is an architectural property, not a contingency plan. Control Plane runs natively across AWS, Azure, and GCP. Global load balancing routes traffic to the nearest operational environment automatically. When one provider has a bad day, your users don’t notice. That’s not a failover feature; it’s the default behavior of the platform.

The result is that the scorecard above isn’t a gap analysis for Control Plane customers. It’s a description of what’s already running.

Architecting for the Inevitable

Outages are no longer exceptional events. They are a normal operating condition of modern, distributed systems, and the organizations that understand this are the ones building the most durable businesses.

The shift in mindset is subtle but consequential. Resilience used to be framed as insurance: money spent to protect against something unlikely. That framing made it easy to defer, easy to underinvest, easy to treat as someone else’s problem until it wasn’t. What October 20th made visible, again, is that the question was never whether an outage would happen. It was whether your architecture would notice.

The companies that stayed online that morning didn’t just avoid downtime. They absorbed market share from competitors who didn’t. Their customers experienced nothing. Their boards asked no difficult questions. Their engineering teams woke up to a normal Monday. While others were triaging cascading failures and drafting incident communications, they were already ahead.

That is what resilience looks like when it’s designed in rather than bolted on. Not a heroic recovery. Not a well-executed runbook. Just a system doing what it was built to do; running, quietly and continuously, while everything around it was on fire.

The organizations worth building are the ones that never make the outage headline. Not because they were lucky. Because they were ready.

Want to run the scorecard against your own architecture with a Control Plane engineer? We’ll show you where the gaps are and what it takes to close them.

Back to Blog