Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
We unplugged a data center to test our disaster readiness (dropbox.tech)
259 points by ianrahman on April 26, 2022 | hide | past | favorite | 57 comments


I forget the name, but I interviewed at a medical technology company a while back. They said they failed over a data center every single day and left it offline for a while. They needed to make sure their redundancy was super reliable so they failed over daily to ensure they engineered everything super robustly.


It'd be funny if they ended up with bugs in their systems that only show up if the systems aren't turned off often enough.


This reminds me of the time I "temporarily" fixed a memory leak in our servers by writing a cron job that rebooted them every 20 minutes. And then I forgot about it.


Ah memories. We have a web service in IIS which is used to generate some HTML and take a screen grab of it to generate images to display of some data.

The library was created in .net framework 2.0 and no longer exists or is maintained. It’s been on the list for 8 years to replace…

It has a memory leak which causes IIS to consume gigs of data over time, resulting in IIS completely restarting itself and causing all hosted sites to go offline for 30s. so I set the IIS pool to recycle every hour and it hasn’t caused any issues in ~ 5 years. Due to this I feel like it’s dropped even lower in priority. :(


That's the dirty pragmatic side of engineering. Technically it is fixed.


Hah, yes, I've unironically used https://github.com/kzk/unicorn-worker-killer in production before. ~900 stars on Github...


This is very likely. I worked for a company where started deploying code less frequently and we discovered a few uptime related issues.


I've seen this on services that do CI/CD, but institute a freeze for peak traffic events. In theory, the freeze serves the purpose of increasing capacity, since all services are serving traffic rather than updating software, and increasing reliability, since software isn't changing. However, it can then expose latent memory leaks that weren't visible when the services were cycling all the time.


Heroku used to cycle every dyno approximately once a day (for all I know, it still does…). Don’t ask me how I know my apps were free of memory leaks.


They still do that.



It's imperfect code, but I actually wish full reboots were more common for critical tech like this.

How many times have we rebooted our machines this week to fix some weird issue? (or maybe my OS install is buggy, but still :P)


Lemme check:

$ uptime

19:14 up 42 days, 6:49, 11 users, load averages: 7.85 8.24 8.07

Zero times this week (or month).


That would probably be a good thing, means you can't break procedure easily!


I know this can be a good idea because it helps you catch stuff, but man how could you convince patients that you're a reliable company if you did lose something by deliberately turning stuff off and on again. I wonder if you could argue it's negligence. Every day seems like you're just inviting a mistake at some point!


Doesn't routine make you better at things?

(And since everybody is now going to share war stories... When I worked at Con-way, now XPO Logistics, they had the pop-tart incident, where a kitchen mishap triggered fire regulations that mandated the data center be taken down, and since they had absolutely *zero* routine in restarting their data center, folks got to debug the backup plans in real-time, until two weeks later all service had been restored. Kind of. Me suggesting we should test this more often was only perceived as a joke in bad taste. I thought it was a swell idea, since our Sundays were scheduled downtime anyway. Yep, the whole day. Every one or two weeks. Suffice to say, it wasn't a very long stint.)


People improve with routine when the environment is static, or within certain ranges. Routine degrades performance when previous assumptions shift (or are changed by dynamic agents).


The gamble is to have small, contained issues you can deal with in a timely manner, vs full scale propagated failures you'd have to deal with at the worse time ever.

It's like accidents during fire drills, they happen, yet it's worth doing all things considered.


That's the whole point. Every feature they build absolutely has to take failure into account so they do. Whereas if you only failed in emergencies, you would find stuff that didn't handle the failure well.


Sure I get this in principle but if you keep rebooting something can you instigate errors that are very unlikely to normally occur, by increasing the probability? Feels like you're going to trip an edge case maybe even lower in the hardware stack than your software if you do this so often.


I mean restarting hardware once a day isn't going to hurt the hardware (indeed its the norm in some places to shut down many racks after business hours), and software will probably only see less bugs.


You could say the same thing about never testing your DR/backup capabilities. That's way more negligent.


You never want to use the parachute, but should always know it works.


Does their redundency extend to two failures at the same time? If not then having one intentional failure per day is risky.


Oh come on, what are the odds a second system fails right when the first one is taken offline? /s


100% :-)


This is pretty much my infrastructure ideology now: everything, everything within my control has a well-defined, short lifetime (ideally less then a day).


I organised one of these as a junior dev at my first job! Glad to see I was ahead of the curve. cough.


Similar to Facebook's Storm initiative: https://www.forbes.com/sites/roberthof/2016/09/11/interview-...

These exercises happen several times a year.


I used to be heavily involved in those, it was a process where we’d take a week to prepare for it, do weeks of post-mortems, and print a run of t-shirts for everyone involved to celebrate pulling one off successfully.

These days the team running them announces that it’s happening in an opt-in announcement group at 8am, pulls the plug at 9am, and barely anyone even notices because the automation handles it so gracefully.

Mostly I just miss the t-shirts, as the <datacenter>-storm events got the coolest graphical designs...


Oh hey there! Can confirm that Storms got very very smooth over the course of a few years - it was really incredible to see it go from something that we planned on our roadmap to getting into work and realizing an entire datacenter was disconnected from the backbone while everything ran mostly smooth.


Google does this yearly, with different scenarios - it is called DiRT and you can read a little here: https://cloud.google.com/blog/products/management-tools/shri...


We do this yearly.

The first time we did it was painful. It took all day, probably around 40 people involved total. We found all sorts of problems, and we actually only failed over the production site, nothing else.

We'll be doing it for I think the fifth time in a few months. Mostly automated now, last year was pretty smooth, aside from a couple bad assumptions that crept back in to some code.

It is like anything else, repetition makes perfect.


> Not Found

> The requested URL /infrastructure/disaster-readiness-test-failover-blackhole-sjc was not found on this server.

> Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

I see which data center was hosting the article, then.

Internet Archive to the rescue: https://web.archive.org/web/20220426191128/https://dropbox.t...


How big is metaserver these days?

I might run three deployments at each data center: the primary, and secondaries for two other regions. Replicate between them at the block device level, bypassing the mysql replication situation entirely (except for on-disk consistency requirements of course).

Of course this comes with a 3x increase in service infrastructure costs because of the two backups in each data center that are idle waiting for load.


And much higher write latency, assuming write consistency is important to you.


Ah metaserver! I wonder how much of my code is still there (and if it is still Py2). I read the Atlas blog post, which implied that metaserver was rearchitected, but still 3M SLOC of Python.

https://dropbox.tech/infrastructure/atlas--our-journey-from-...


> This complex ownership model made it hard to move just a subset of user data to another region.

Welcome to basically any large-scale enterprise.

I have grown to learn that the active-passive strategy is the best option if the business can tolerate the necessary human delays. You get to side-step so much complicated bullshit with this path.

Trying to force active-active instant failover magic is sometimes not possible, profitable or even desirable. I can come up with a few scenarios where I would absolutely insist that a human go and check a few control points before letting another exact copy of the same system start up on its own, even if it would be possible in theory to have automatic failover work reliably 99.999% of the time.


My fear with any sort of passive standby approach is that when the disaster comes, that standby won't work, or the mechanism used to fail over to it won't work. I prefer schemes where the "failover" is happening all the time hence I can be confident it works.


A solid active/standby design should regularly flop. Every 2 weeks seems to be a sweet spot. This also balances wear across consumable hardware like disks.

If your failover is happening "all the time" you basically just have a single system with failures.


Except for this:

> However, because of that choice, replication between regions is asynchronous—meaning the remote replicas are always some number of transactions behind the primary region.

Switching the active is a data loss event.

Or, you can choose to drain the load and wait for the replication to catch up, but that's downtime and not a test of the real failure mechanism.


Having a passive strategy doesn’t mean you don’t test it, and you can even perform actual failovers once on a while to validate everything.

Active-active is also valid, but the point is that it comes with a huge amount of increased complexity. At some point you need to make a value calculation to decide if you want to focus on that, or on building the product.


>Given San Jose's proximity to the the San Andreas Fault, it was critical we ensured an earthquake wouldn't take Dropbox offline.

>Given this context, we structured our RTO—and more broadly, our disaster readiness plans—around imminent failures where our primary region is still up, but may not be for long.

IMO if the big one hits San Andreas, the SJC facilities will likely go down with ~0 warning. Certainly not enough time to drain user traffic.

It's interesting to note that Dropbox realistically can probably tolerate a loss of a few second to minutes of user data in a major earthquake, but cannot tolerate the same losses to perform realistic tests (just yank the cable, no warning).

If the earthquake hits at 3am in SF, it'll likely take both the metro and a significant number of the DR team out of the picture for at least a period of time. Surviving that kind of blow in the short term with 0 downtime is a very hard goal.


A random (large) earthquake along the San Andreas fault is not the same as all of western USA ripping apart. A much more likely scenario is that the power grid goes down and the data center stays reasonably intact for a while on emergency power.


I admire the forward thinking and application of common sense here. There's literally no better way of testing the system than this. It seems that a lot of big tech companies would never have the balls to do this themselves.


Netflix has Chaos Gorilla, Facebook has Storms(?), Google has DiRT. Everyone does this type of testing.


That is different to literally turning a data center off


I am curious why you think that isn't the case? I've done it multiple times - simulating mains power failures, total power failure, split brains, and fiber cuts to the building.


They seem to say that there is exactly one optical fiber going to exactly one device which is all of the connection of each of their datacenter.

Is that correct? That seems very single-point-of-failureish to me.


I think its just a simplification. For the purposes of the article, it doesn't matter if its 1 or 100. They completely unplugged the network as a whole.

Later in the article they do say 'began to reconnect the network fiber', which implies multiple connections to me. They also took reference photos and ordered backup hardware. So I definitely don't think it was a single connection.


I wonder where that cable comes from and who maintains it.


This read to me as much like a history of technical debt as an article about current efforts.


DiRT


Take that, ChaosMonkey! This is King Kong


Chaos Gorilla has already existed for over a decade


Given that the blog doesn't load for me, I guess the datacenter remains unplugged?


gotta prepare for the climate crisis, as Douglas Rushkoff discovered, rich capitalists "are plotting to leave us behind"

https://onezero.medium.com/survival-of-the-richest-9ef6cddd0...

unpaywalled: https://archive.ph/AABsP




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: