Cloudflare's 'Fail Small' Initiative: A Stronger, More Resilient Network

By • min read

Introduction

After more than two quarters of intensive engineering work, Cloudflare has completed a major initiative internally known as "Code Orange: Fail Small". This project focused on making the company's infrastructure more resilient, secure, and reliable for every customer. While improving resiliency is never truly finished—it remains a top priority throughout the development lifecycle—the team has now shipped the improvements that would have prevented the global outages on November 18, 2025 and December 5, 2025. The work spanned several critical areas: safer configuration changes, reducing the impact of failure, overhauling "break glass" procedures, refining incident management, preventing drift and regressions over time, and strengthening customer communications during outages. Below, we explain in depth what was shipped and what it means for you.

Cloudflare's 'Fail Small' Initiative: A Stronger, More Resilient Network — Source: blog.cloudflare.com

Safer Configuration Changes

The most significant change involves how internal configuration updates are deployed. Instead of pushing changes across the entire network instantly, Cloudflare now uses a progressive rollout with real-time health monitoring. This allows observability tools to catch potential problems and automatically revert changes before they can affect your traffic. To achieve this, the team identified high-risk configuration pipelines and built new tools to manage those changes more effectively. For all products that run on the network and process customer traffic, configuration deployments now follow a "health-mediated deployment" methodology—the same approach already used for software releases. This applies not only to the teams directly involved in the November and December incidents but across the board.

The Snapstone Component

Central to this new process is an internal component called Snapstone. Snapstone bundles configuration changes into discrete packages and then releases them gradually, applying health mediation principles at every step. Before Snapstone, using health-mediated deployment for configuration was technically possible but required significant per-team effort and was not consistently applied across the network. Snapstone closes that gap by providing a unified way to bring progressive rollout, real-time health monitoring, and automated rollback to configuration deployments by default. What makes Snapstone especially powerful is its flexibility: it is not a fix for specific past failures. Instead, it allows teams to dynamically define any unit of configuration that needs health mediation—whether a data file (like the one that caused the November 18 outage) or a control flag in the global configuration system (like the one involved in the December 5 outage). Teams can create these configuration units on demand, making the network safer for everyone.

Reducing the Impact of Failure

Another key area was reducing the blast radius of any single failure. Cloudflare has introduced architectural changes that isolate failures to smaller parts of the network, preventing a minor issue from cascading into a global outage. This includes segmenting configuration pipelines so that a problem in one product area does not affect unrelated services. Additionally, new circuit breakers and rate limiters have been added to critical services, ensuring that if something does go wrong, the impact remains contained.

Revised Break Glass and Incident Management

The team also completely revised their "break glass" procedures—the emergency protocols used when normal safety checks must be bypassed to restore service quickly. The new procedures require tighter approval chains and automated logging, reducing the chance of human error during a crisis. Incident management has been improved with clearer escalation paths, better communication tools, and post‑mortem processes that focus on systemic fixes rather than blaming individuals.

Preventing Drift and Regression

To ensure that improvements last, Cloudflare introduced measures to prevent drift—where configurations slowly deviate from intended settings—and regressions over time. Automated verification checks now run continuously against the network's configuration, flagging any unauthorized changes. Regular audits and training sessions reinforce the new best practices, making resilience a permanent part of the engineering culture.

Enhanced Customer Communication

Finally, Cloudflare has strengthened how it communicates with customers during outages. Real‑time status dashboards now provide more granular information about the scope and expected resolution time of incidents. After an outage, detailed post‑mortem reports are published more quickly, with clear explanations of root causes and the specific engineering changes that will prevent recurrence. This transparency helps customers understand exactly what happened and what steps have been taken to protect their traffic.

Conclusion

The completion of "Code Orange: Fail Small" marks a major milestone in Cloudflare's journey toward a more reliable network. By investing in smarter deployment processes like Snapstone, reducing failure impact, revamping emergency procedures, preventing drift, and improving communication, the team has built a stronger foundation for future growth. While the work will never be done—resilience is an ongoing effort—the network is now far better equipped to handle the unexpected, giving customers greater confidence in the services they rely on every day.