We unplugged a data center to test our disaster readiness

leros · on April 26, 2022

I forget the name, but I interviewed at a medical technology company a while back. They said they failed over a data center every single day and left it offline for a while. They needed to make sure their redundancy was super reliable so they failed over daily to ensure they engineered everything super robustly.

acid__ · on April 26, 2022

It'd be funny if they ended up with bugs in their systems that only show up if the systems aren't turned off often enough.

leros · on April 27, 2022

This reminds me of the time I "temporarily" fixed a memory leak in our servers by writing a cron job that rebooted them every 20 minutes. And then I forgot about it.

philliphaydon · on April 27, 2022

Ah memories. We have a web service in IIS which is used to generate some HTML and take a screen grab of it to generate images to display of some data.

The library was created in .net framework 2.0 and no longer exists or is maintained. It’s been on the list for 8 years to replace…

It has a memory leak which causes IIS to consume gigs of data over time, resulting in IIS completely restarting itself and causing all hosted sites to go offline for 30s. so I set the IIS pool to recycle every hour and it hasn’t caused any issues in ~ 5 years. Due to this I feel like it’s dropped even lower in priority. :(

leros · on April 27, 2022

That's the dirty pragmatic side of engineering. Technically it is fixed.

acid__ · on April 27, 2022

Hah, yes, I've unironically used https://github.com/kzk/unicorn-worker-killer in production before. ~900 stars on Github...

swsieber · on April 27, 2022

This is very likely. I worked for a company where started deploying code less frequently and we discovered a few uptime related issues.

nickm12 · on April 28, 2022

I've seen this on services that do CI/CD, but institute a freeze for peak traffic events. In theory, the freeze serves the purpose of increasing capacity, since all services are serving traffic rather than updating software, and increasing reliability, since software isn't changing. However, it can then expose latent memory leaks that weren't visible when the services were cycling all the time.

CoffeeOnWrite · on April 27, 2022

Heroku used to cycle every dyno approximately once a day (for all I know, it still does…). Don’t ask me how I know my apps were free of memory leaks.

ceejayoz · on April 27, 2022

They still do that.

bigiain · on April 27, 2022

Like, say, an airliner?

https://www.theregister.com/2020/04/02/boeing_787_power_cycl...

spicybright · on April 27, 2022

It's imperfect code, but I actually wish full reboots were more common for critical tech like this.

How many times have we rebooted our machines this week to fix some weird issue? (or maybe my OS install is buggy, but still :P)

bigiain · on April 28, 2022

Lemme check:

$ uptime

19:14 up 42 days, 6:49, 11 users, load averages: 7.85 8.24 8.07

Zero times this week (or month).

spicybright · on April 27, 2022

That would probably be a good thing, means you can't break procedure easily!

exdsq · on April 26, 2022

I know this can be a good idea because it helps you catch stuff, but man how could you convince patients that you're a reliable company if you did lose something by deliberately turning stuff off and on again. I wonder if you could argue it's negligence. Every day seems like you're just inviting a mistake at some point!

sverhagen · on April 26, 2022

Doesn't routine make you better at things?

(And since everybody is now going to share war stories... When I worked at Con-way, now XPO Logistics, they had the pop-tart incident, where a kitchen mishap triggered fire regulations that mandated the data center be taken down, and since they had absolutely *zero* routine in restarting their data center, folks got to debug the backup plans in real-time, until two weeks later all service had been restored. Kind of. Me suggesting we should test this more often was only perceived as a joke in bad taste. I thought it was a swell idea, since our Sundays were scheduled downtime anyway. Yep, the whole day. Every one or two weeks. Suffice to say, it wasn't a very long stint.)

edmundsauto · on April 27, 2022

People improve with routine when the environment is static, or within certain ranges. Routine degrades performance when previous assumptions shift (or are changed by dynamic agents).

makeitdouble · on April 27, 2022

The gamble is to have small, contained issues you can deal with in a timely manner, vs full scale propagated failures you'd have to deal with at the worse time ever.

It's like accidents during fire drills, they happen, yet it's worth doing all things considered.

leros · on April 27, 2022

That's the whole point. Every feature they build absolutely has to take failure into account so they do. Whereas if you only failed in emergencies, you would find stuff that didn't handle the failure well.

exdsq · on April 27, 2022

Sure I get this in principle but if you keep rebooting something can you instigate errors that are very unlikely to normally occur, by increasing the probability? Feels like you're going to trip an edge case maybe even lower in the hardware stack than your software if you do this so often.

bobthebuilders · on April 28, 2022

I mean restarting hardware once a day isn't going to hurt the hardware (indeed its the norm in some places to shut down many racks after business hours), and software will probably only see less bugs.

SteveNuts · on April 26, 2022

You could say the same thing about never testing your DR/backup capabilities. That's way more negligent.

notjustanymike · on April 27, 2022

You never want to use the parachute, but should always know it works.

andi999 · on April 27, 2022

Does their redundency extend to two failures at the same time? If not then having one intentional failure per day is risky.

bobbylarrybobby · on April 27, 2022

Oh come on, what are the odds a second system fails right when the first one is taken offline? /s

andi999 · on April 27, 2022

100% :-)

XorNot · on April 27, 2022

This is pretty much my infrastructure ideology now: everything, everything within my control has a well-defined, short lifetime (ideally less then a day).

exdsq · on April 26, 2022

I organised one of these as a junior dev at my first job! Glad to see I was ahead of the curve. cough.

ckwalsh · on April 26, 2022

Similar to Facebook's Storm initiative: https://www.forbes.com/sites/roberthof/2016/09/11/interview-...

These exercises happen several times a year.

Shish2k · on April 26, 2022

I used to be heavily involved in those, it was a process where we’d take a week to prepare for it, do weeks of post-mortems, and print a run of t-shirts for everyone involved to celebrate pulling one off successfully.

These days the team running them announces that it’s happening in an opt-in announcement group at 8am, pulls the plug at 9am, and barely anyone even notices because the automation handles it so gracefully.

Mostly I just miss the t-shirts, as the <datacenter>-storm events got the coolest graphical designs...

wrigby · on April 27, 2022

Oh hey there! Can confirm that Storms got very very smooth over the course of a few years - it was really incredible to see it go from something that we planned on our roadmap to getting into work and realizing an entire datacenter was disconnected from the backbone while everything ran mostly smooth.

dmitrygr · on April 26, 2022

Google does this yearly, with different scenarios - it is called DiRT and you can read a little here: https://cloud.google.com/blog/products/management-tools/shri...

_lqaf · on April 27, 2022

We do this yearly.

The first time we did it was painful. It took all day, probably around 40 people involved total. We found all sorts of problems, and we actually only failed over the production site, nothing else.

We'll be doing it for I think the fifth time in a few months. Mostly automated now, last year was pretty smooth, aside from a couple bad assumptions that crept back in to some code.

It is like anything else, repetition makes perfect.

deathanatos · on April 26, 2022

> Not Found

> The requested URL /infrastructure/disaster-readiness-test-failover-blackhole-sjc was not found on this server.

> Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

I see which data center was hosting the article, then.

Internet Archive to the rescue: https://web.archive.org/web/20220426191128/https://dropbox.t...

aftbit · on April 26, 2022

How big is metaserver these days?

I might run three deployments at each data center: the primary, and secondaries for two other regions. Replicate between them at the block device level, bypassing the mysql replication situation entirely (except for on-disk consistency requirements of course).

Of course this comes with a 3x increase in service infrastructure costs because of the two backups in each data center that are idle waiting for load.

Johnny555 · on April 26, 2022

And much higher write latency, assuming write consistency is important to you.

nickm12 · on April 28, 2022

Ah metaserver! I wonder how much of my code is still there (and if it is still Py2). I read the Atlas blog post, which implied that metaserver was rearchitected, but still 3M SLOC of Python.

https://dropbox.tech/infrastructure/atlas--our-journey-from-...

bob1029 · on April 26, 2022

> This complex ownership model made it hard to move just a subset of user data to another region.

Welcome to basically any large-scale enterprise.

I have grown to learn that the active-passive strategy is the best option if the business can tolerate the necessary human delays. You get to side-step so much complicated bullshit with this path.

Trying to force active-active instant failover magic is sometimes not possible, profitable or even desirable. I can come up with a few scenarios where I would absolutely insist that a human go and check a few control points before letting another exact copy of the same system start up on its own, even if it would be possible in theory to have automatic failover work reliably 99.999% of the time.

dboreham · on April 26, 2022

My fear with any sort of passive standby approach is that when the disaster comes, that standby won't work, or the mechanism used to fail over to it won't work. I prefer schemes where the "failover" is happening all the time hence I can be confident it works.

mike_d · on April 26, 2022

A solid active/standby design should regularly flop. Every 2 weeks seems to be a sweet spot. This also balances wear across consumable hardware like disks.

If your failover is happening "all the time" you basically just have a single system with failures.

yencabulator · on April 29, 2022

Except for this:

> However, because of that choice, replication between regions is asynchronous—meaning the remote replicas are always some number of transactions behind the primary region.

Switching the active is a data loss event.

Or, you can choose to drain the load and wait for the replication to catch up, but that's downtime and not a test of the real failure mechanism.

orev · on April 26, 2022

Having a passive strategy doesn’t mean you don’t test it, and you can even perform actual failovers once on a while to validate everything.

Active-active is also valid, but the point is that it comes with a huge amount of increased complexity. At some point you need to make a value calculation to decide if you want to focus on that, or on building the product.

aftbit · on April 26, 2022

>Given San Jose's proximity to the the San Andreas Fault, it was critical we ensured an earthquake wouldn't take Dropbox offline.

>Given this context, we structured our RTO—and more broadly, our disaster readiness plans—around imminent failures where our primary region is still up, but may not be for long.

IMO if the big one hits San Andreas, the SJC facilities will likely go down with ~0 warning. Certainly not enough time to drain user traffic.

It's interesting to note that Dropbox realistically can probably tolerate a loss of a few second to minutes of user data in a major earthquake, but cannot tolerate the same losses to perform realistic tests (just yank the cable, no warning).

If the earthquake hits at 3am in SF, it'll likely take both the metro and a significant number of the DR team out of the picture for at least a period of time. Surviving that kind of blow in the short term with 0 downtime is a very hard goal.

paxys · on April 26, 2022

A random (large) earthquake along the San Andreas fault is not the same as all of western USA ripping apart. A much more likely scenario is that the power grid goes down and the data center stays reasonably intact for a while on emergency power.

lloydatkinson · on April 26, 2022

I admire the forward thinking and application of common sense here. There's literally no better way of testing the system than this. It seems that a lot of big tech companies would never have the balls to do this themselves.

mike_d · on April 26, 2022

Netflix has Chaos Gorilla, Facebook has Storms(?), Google has DiRT. Everyone does this type of testing.

lloydatkinson · on April 28, 2022

That is different to literally turning a data center off

mike_d · on April 28, 2022

I am curious why you think that isn't the case? I've done it multiple times - simulating mains power failures, total power failure, split brains, and fiber cuts to the building.

323 · on April 26, 2022

They seem to say that there is exactly one optical fiber going to exactly one device which is all of the connection of each of their datacenter.

Is that correct? That seems very single-point-of-failureish to me.

Bedon292 · on April 27, 2022

I think its just a simplification. For the purposes of the article, it doesn't matter if its 1 or 100. They completely unplugged the network as a whole.

Later in the article they do say 'began to reconnect the network fiber', which implies multiple connections to me. They also took reference photos and ordered backup hardware. So I definitely don't think it was a single connection.

selfhoster69 · on April 27, 2022

I wonder where that cable comes from and who maintains it.

rkagerer · on April 26, 2022

This read to me as much like a history of technical debt as an article about current efforts.

outside1234 · on April 26, 2022

zoover2020 · on April 26, 2022

Take that, ChaosMonkey! This is King Kong

qw3rty01 · on April 26, 2022

Chaos Gorilla has already existed for over a decade

eptcyka · on April 26, 2022

Given that the blog doesn't load for me, I guess the datacenter remains unplugged?

beckman466 · on April 26, 2022

gotta prepare for the climate crisis, as Douglas Rushkoff discovered, rich capitalists "are plotting to leave us behind"

https://onezero.medium.com/survival-of-the-richest-9ef6cddd0...

unpaywalled: https://archive.ph/AABsP