Sounds like cloud is getting ever closer to dedicated servers. One would hope th...

neo01124 · on Aug 24, 2020

Hi!

Author of the article here.

The core concern is not about the capabilities of the compute abstraction being used (bare metal, containers or functions) or testing OS capabilities. The aim is to validate mitigations which are in place to counter turbulent scenarios (For example: massive spike in traffic, network outage, dependency is down, etc). These scenarios generally originate outside the given system.

These kind of questions should be asked and systematically validated (quoting the article):

* Have you tested how the system behaves when the underlying instances have a sustained CPU spike?

* Is the system behavior understood under different stress?

* Is there sufficient monitoring?

* Have the alarms been validated?

* Are there any countermeasures implemented? For example, is auto-scaling set up, and does it behave as expected? Are timeouts and retries appropriate?

fxtentacle · on Aug 24, 2020

I believe we just have a rather different approach here.

"Have you tested how the system behaves when the underlying instances have a sustained CPU spike?"

Since dedicated boxes are cheap, I'd just buy 5x the CPU resources that I reasonably need and call it a day. If there ever is a more than 5x traffic spike, then docker will prevent it from being a noisy neighbor, so the affected services will just become slower than usual. But even a 10x traffic multiplier would just produce a 2x slowdown, which should be tolerable for most users.

I agree that on clouds you want to save costs by only booking what you need. But bare metal, you can usually afford to keep spare capacity around all the time.

As such, I wouldn't plan for the system to behave well under stress. I'd try to always have enough resources around so that stress never happens. At the end of the day, this seems like a developer time vs. resource costs trade-off and for most companies, developers are sparse and resources are plentiful, so they'll have a very different trade-off from big FAANG companies.

"For example, is auto-scaling set up, and does it behave as expected?"

If your system is usually 90% idle, I wonder if you'll ever need that auto-scaling. Also, I'd say my customers can endure it if page load time goes up from 100ms to 200ms. So in my opinion, there is little need for auto-scaling for most companies.

ses1984 · on Aug 24, 2020

>"Have you tested how the system behaves when the underlying instances have a sustained CPU spike?"

You didn't really address this question, you addressed a different question, which is a traffic spike.

>Also, I'd say my customers can endure it if page load time goes up from 100ms to 200ms. So in my opinion, there is little need for auto-scaling for most companies.

100ms to 200ms average? What about the tail? Your app might go from P99 - 500ms to P95 - timeout. That's when you'll lose customers.

fxtentacle · on Aug 24, 2020

If the underlying hardware is a bare metal server, it won't magically turn slow and have a CPU spike. That problem is caused noisy neighbor and kind of exclusive to clouds.

Well, with the 2x example, my app might get from a 1s P99 to a 2s P99 which feels slow, but is still doable. Again, those timeouts are usually introduced by cloud infrastructure. For example, if you use nginx outside of Heroku, it won't have a 30s timeout for file downloads.

ses1984 · on Aug 24, 2020

Your own instances can have an unexpected CPU spike.

Even if you're running on bare metal I find it hard to believe you don't have a layer with short timeouts between your front and backend.

fxtentacle · on Aug 24, 2020

Why would I? I have redundant 1GBit LAN cables between front end, back end, and database servers.

ses1984 · on Aug 24, 2020

Because it's bad ux for your users to see a spinning loading icon forever.

fxtentacle · on Aug 24, 2020

And a timeout error would be better?

ses1984 · on Aug 25, 2020

In my experience, yes, a lot better.

neo01124 · on Aug 24, 2020

No, there is no different approach. You are misunderstanding what is being addressed in the article. This is not about bare metal vs cloud or autoscaling vs no-autoscaling/overscaling or developer time vs resource costs.

The article talks about injecting failures at various points in the system, understanding how the system behaves under this stress, putting in counter-measures for the resulting problems, and eventually re-running this to validate those counter-measures.

fxtentacle · on Aug 24, 2020

I did understand what the article is about. But you only need to worry about failure under stress if you are driving your system close to the hardware's limit.

In the virtual cloud world, that is common, because you rent the cheapest instance that will be big enough. In the bare metal world, that is rarely the case, because you usually get a Ryzen with 16+ cores and 128+ GB of RAM. In that case, there's no point in checking what will happen to your 200 MB web app if there's a CPU spike. It'll be just fine because the hardware can handle 10x the load without a hickup.

Similarly, if your page load time is dominated by internet latency, it doesn't matter if your CPU needs a few more ms to spit out the page HTML. So there, a 2x CPU usage increase will be barely noticeable to the user.

bpicolo · on Aug 24, 2020

> Since dedicated boxes are cheap, I'd just buy 5x the CPU resources that I reasonably need and call it a day

This isn't the case when your baseline is 6-7+ figures worth of machines

fxtentacle · on Aug 24, 2020

I fully agree. I'm usually working with companies in the $10mio to $100mio ARR range. So obviously my way of doing things will fail at Amazon's scale. But let's face it, most developers are not at FAANG but at normal mid sized companies.

strgcmc · on Aug 24, 2020

Let's try to tie together what you're talking about (auto-scaling/capacity), with the OP and blog post was mainly about (chaos engineering/engineering-for-failure). Imagine:

- You operate a service with significant traffic, and through empirical experience, you have a good handle on what 1x traffic looks like, and have even seen spikes to 2x traffic on rare occasion, which your overall system handled just fine. Applying your overall philosophy, you setup your system to allow for 5x the CPU resources you need, and call it a day, nothing to see here.

- But, guess what? Unbeknownst to you, your system has some critical bottleneck that would only surface at 3x your usual traffic, which could be anything from hitting some misconfigured max limits on your load-balancer, or exhausting all your database connections, or running out of threads or inodes on your server hosts, or triggering a kind of retry-storm/brownout due to slowly increasing latency in one of your service calls that only explodes past a certain limit (due to some unintended interaction with your core timeout/retry logic), or any number of latent potential bottlenecks that you never knew about, because as long as your system stayed under the critical limit, it was completely invisible to you. In other words, these are non-linear failures, that you cannot simply solve by extrapolating out with "1x traffic = 1x # of servers, 5x traffic = 5x # of servers".

- As a result, not only do you don't have nearly as much head-room for scaling up as you think you do, but ALSO when you do encounter such a failure, you cannot easily just "scale out" horizontally, because the failure mode itself is only exacerbated by horizontal scaling. When you encounter such failures that break some axiomatic assumptions you have about your system, it can be incredibly difficult/painful to reconcile, especially if you had no plans and no knowledge about these invisible/latent aspects of your system ahead of time.

Chaos engineering isn't about scaling at all, not really. It's about finding latent defects in your system, by actively probing your assumptions and seeing if your system behaves as you would expect. Using traffic to generate stress on the system is just one way to introduce some "chaos", but there are many other ways too (as covered in the article).

Of course, it's also true that systems need to reach a certain minimum level of complexity, before the ROI of introducing chaos engineering becomes really worth it. You need to have a complex-enough set of services, dependencies, or interconnected components that are likely enough to behave in non-obvious ways, that you have to do independent chaos engineering to test them effectively, rather than simply reasoning about their properties directly.

fxtentacle · on Aug 24, 2020

I wholeheartedly agree with your last paragraph.

My experience is that I have yet to work with a company where this level of failure-proofing makes financial sense. Purchasing more hardware than needed is relatively cheap for most medium-sized companies, and it provides a fair level of protection against outlier accidents.

I'm aware that many people using cloud also ascribe to the 100% uptime mentality, but for most companies that is simply not needed. I mean even for Netflix or Amazon Prime Video, I wonder if 2 hours of unexpected downtime per year would really be enough to make anyone cancel their service. I myself at least have spent much more time than that trying to get HDCP graphics cards drivers, HDMI cables, and the stars to align so that the Netflix app will work with 4K HDR playback on my TV.

So yes, (your 2nd paragraph) I would knowingly accept that there are critical bottlenecks that are unknown and that could be triggered by severe traffic spikes. And most of my customers would be happy to accept that risk in exchange for the cost savings of not proactively fixing the issue.

And if you look at the overall state of software, it looks like pretty much every company is happy to trade reliability/resilience for cost savings these days. That's why I applaud the efforts in the original article, but the pragmatic way seems to be to just skip the whole thing.