> work has already begun on how we will harden them against failures like this i...

nikcub · 2025-11-19T00:21:48 1763511708

They require the bot management config to update and propagate quickly in order to respond to attacks - but this seems like a case where updating a since instance first would have seen the panic and stopped the deploy.

I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons

[0] https://clickhouse.com/docs/guides/developer/deduplication

HumanOstrich · 2025-11-19T00:48:11 1763513291

I don't think sqlite would come close to their requirements for permissions or resilience, to name a couple. It's not the solution for every database issue.

Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.

hedora · 2025-11-19T03:02:32 1763521352

I think the idea is to ship the sqlite database around.

It’s not a terrible idea, in that you can test the exact database engine binary in CI, and it’s (by definition) not a single point of failure.

HumanOstrich · 2025-11-19T03:16:54 1763522214

I think you're oversimplifying the problem they had, and I would encourage you to dive in to the details in the article. There wasn't a problem with the database, it was with the query used to generate the configs. So if an analogous issue arose with a query against one of many ad-hoc replicated sqlite databases, you'd still have the failure.

I love sqlite for some things, but it's not The One True Database Solution.

Scaevolus · 2025-11-19T00:40:18 1763512818

Global configuration is useful for low response times to attacks, but you need to have very good ways to know when a global config push is bad and to be able to rollback quickly.

In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.

Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.

mewpmewp2 · 2025-11-19T00:22:18 1763511738

It seems they had this continous rollout for the config service, but the services consuming this were affected even by small percentage of these config providers being faulty, since they were auto updating every few minutes their configs. And it seems there is a reason for these updating so fast, presumably having to react to threat actors quickly.

otterley · 2025-11-19T00:28:04 1763512084

It's in everyone's interest to mitigate threats as quickly as possible. But it's of even greater interest that a core global network infrastructure service provider not DOS a significant proportion of the Internet by propagating a bad configuration too quickly. The key here is to balance responsiveness against safety, and I'm not sure they struck the right balance here. I'm just glad that the impact wasn't as long and as severe as it could have been.

tptacek · 2025-11-19T00:40:21 1763512821

This isn't really "configuration" so much as it is "durable state" within the context of this system.

otterley · 2025-11-19T00:46:37 1763513197

In my 30 years of reliability engineering, I've come to learn that this is a distinction without a difference.

People think of configuration updates (or state updates, call them what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will allow changes like these into production unattended -- even ones who wouldn't dare let a single line of code go live without being subject to the full CI/CD process.

HumanOstrich · 2025-11-19T00:53:44 1763513624

They narrowed down the actual problem to some Rust code in the Bot Management system that enforced a hard limit on the number of configuration items by returning an error, but the caller was just blindly unwrapping it.

otterley · 2025-11-19T00:58:49 1763513929

A dormant bug in the code is usually a condition precedent to incidents like these. Later, when a bad input is given, the bug then surfaces. The bug could have laid dormant for years or decades, if it ever surfaced at all.

The point here remains: consider every change to involve risk, and architect defensively.

tptacek · 2025-11-19T01:04:12 1763514252

They made the classic distributed systems mistake and actually did something. Never leap to thing-doing!

otterley · 2025-11-19T01:10:12 1763514612

If they're going to yeet configs into production, they ought to at least have plenty of mitigation mechanisms, including canary deployments and fault isolation boundaries. This was my primary point at the root of this thread.

And I hope fly.io has these mechanisms as well :-)

tptacek · 2025-11-19T01:13:32 1763514812

We've written at long, tedious length about how hard this problem is.

otterley · 2025-11-19T01:14:32 1763514872

Have a link?

tptacek · 2025-11-19T01:15:20 1763514920

Most recently, a few weeks ago (but you'll find more just a page or two into the blog):

https://fly.io/blog/corrosion/

otterley · 2025-11-19T01:17:37 1763515057

It's great that you're working on regionalization. Yes, it is hard, but 100x harder if you don't start with cellular design in mind. And as I said in the root of the thread, this is a sign that CloudFlare needs to invest in it just like you have been.

tptacek · 2025-11-19T01:22:26 1763515346

I recoil from that last statement not because I have a rooting interest in Cloudflare but because the last several years of working at Fly.io have drilled Richard Cook's "How Complex Systems Fail"† deep into my brain, and what you said runs aground of Cook #18: Failure free operations require experience with failure.

If the exact same thing happens again at Cloudflare, they'll be fair game. But right now I feel people on this thread are doing exactly, precisely, surgically and specifically the thing Richard Cook and the Cook-ites try to get people not to do, which is to see complex system failures as predictable faults with root causes, rather than as part of the process of creating resilient systems.

† https://how.complexsystems.fail/

otterley · 2025-11-19T01:30:29 1763515829

Suppose they did have the cellular architecture today, but every other fact was identical. They'd still have suffered the failure! But it would have been contained, and the damage would have been far less.

Fires happen every day. Smoke alarms go off, firefighters get called in, incident response is exercised, and lessons from the situation are learned (with resulting updates to the fire and building codes).

Yet even though this happens, entire cities almost never burn down anymore. And we want to keep it that way.

As Cook points out, "Safety is a characteristic of systems and not of their components."

HumanOstrich · 2025-11-19T02:05:20 1763517920

What variant of cellular architecture are you referring to? Can you give me a link or few? I'm fascinated by it and I've led a team to break up a monolithic solution running on AWS to a cellular architecture. The results were good, but not magic. The process of learning from failures did not stop, but it did change (for the better).

No matter what architecture, processes, software, frameworks, and systems you use, or how exhaustively you plan and test for every failure mode, you cannot 100% predict every scenario and claim "cellular architecture fixes this". This includes making 100% of all failures "contained". Not realistic.

otterley · 2025-11-19T02:10:55 1763518255

If your AWS service is properly regionalized, that’s the minimum amount of cellular architecture required. Did your service ever fail in multiple regions simultaneously?

Cellular architecture within a region is the next level and is more difficult, but is achievable if you adhere to the same principles that prohibit inter-regional coupling:

https://docs.aws.amazon.com/wellarchitected/latest/reducing-...

HumanOstrich · 2025-11-19T02:21:56 1763518916

You didn't really put any thought into what I said. Thanks for the links.

otterley · 2025-11-19T02:40:17 1763520017

It wasn't worth thinking about. I'm not going to defend myself against arguments and absolute claims I didn't make. The key word here is mitigation, not perfection.

hedora · 2025-11-19T03:09:11 1763521751

> If your AWS service is properly regionalized, that’s the minimum amount of cellular architecture required

Amazon has had multi-region outages due to pushing bad configs, so it’s extremely difficult to believe whatever you are proposing solves that exact problem by relying on multi-regions.

Come to think of it, Cloudflare’s outage today is another good counterexample.

otterley · 2025-11-19T03:53:21 1763524401

It has been a very, very long time since AWS had a simultaneous failure across multiple regions. Even customers impacted by the loss of Route 53 control plane functionality in last month’s us-east-1 were able to gracefully fail over to a backup region if they configured failover records in advance, had Application Recovery Controller set up, or fronted their APIs or websites with Global Accelerator.

Customers survive incidents on a daily basis by failing over across regions (even in the absence of an AWS regional failure, they can fail due to a bad deployment or other cause). The reason you don’t hear about it is because it works.

tptacek · 2025-11-19T01:34:13 1763516053

Pretty sure he's making my point (or, rather, me his) there. (I'm never going to turn down an opportunity to nerd out about Cookism).

asa400 · 2025-11-19T16:32:19 1763569939

Thank you for saying it. I’m getting exasperated at how many people in the comments are making some variant of the “lazy programmer wrote code that took a shortcut” argument.

Complex system failures are not monocausal! Complex systems are in a continuous state of partial failure!

Ekaros · 2025-11-19T06:14:12 1763532852

Sounds like lack of good testing. Too many items in any input should be a boundary case you will get to eventually.

tptacek · 2025-11-19T01:06:01 1763514361

Reframe this problem: instead of bot rules being propagated, it's the enrollment of a new customer or a service at an existing customer --- something that must happen at Cloudflare several times a second. Does it still make sense to you to think about that in terms of "pushing new configuration to prod"?

otterley · 2025-11-19T01:11:32 1763514692

Those aren't the facts before us. Also, CRUD operations relating to a specific customer or user tend not to cause the sort of widespread incidents we saw today.

tptacek · 2025-11-19T01:13:07 1763514787

They're not, they're a response to your claim that "state" and "configuration" are indifferentiable.

Yokohiii · 2025-11-20T10:44:11 1763635451

I think global kill switches are just an last resort machanism, to bypass identified faulty subsystems. Even if there is a risk with it, in this instance the risk was zero, because CF was dead already. This wont change the blast radius, but it's duration and proliferation.

In reference to fault isolation boundaries: I am not familiar with their CI/CD, in theory the error could have been caught/prevented there, but that comes with a lot of depends or it's tricky. But it looks like they didn't go the extra mile to care about safety sensitive areas. So euphemistic speaking, they are now recalibrating balance of safety measures.

ants_everywhere · 2025-11-19T01:49:17 1763516957

it's always a config push. people rollout code slowly but don't have the same mechanisms for configs. But configs are code, and this is a blind spot that causes an outsized percentage of these big outages.

Buttons840 · 2025-11-19T08:42:39 1763541759

When a failsafe system fails, it fails by failing to fail safely.