Designing Resilient Systems

Posted by Daniel Lockhart

I recently returned from Velocity: New York, and I have been thinking quite a bit about how so many of the talks seemed eerily related to what I have been focused on at work here at Verizon Digital Media Services; namely reliability and how to prevent system failures. Now, I am sure a lot of that was simply something similar to the Baader-Meinhof Phenomenon at work (where you hear again about something soon after you first learn about it), but there is still something comforting about knowing others are facing similar problems. I have reliability on my mind, and I am seeing it everywhere.

Building an Antifragile System

There were two talks in particular that planted ideas that have been bouncing around in my head ever since I got back to California. The first was titled “Resilience in Complex Adaptive Systems: Operating at the Edge of Failure” by Richard Cook. The second was “Conditions of Failure: Building Antifragile Systems and Organizations” by Dave Zwieback. While each talk approached the topic from a different angle, they both addressed reliability in the way we face it here at EdgeCast.

When first building out an infrastructure, the initial focus is redundancy; having two database servers for failover, or maybe even a second set of servers ready to go in a different datacenter, for example. This way, if a hard drive on a server fails or a datacenter loses connectivity, you have a backup ready to take over. However, downtime from a lack of redundancy is something we will rarely face at EdgeCast. We have thousands of servers all over the world, and hard drive failures and connectivity issues are problems we solve everyday.

What keeps us up at night is the unknown risk of systemic failure.

The Black Swan, or: the risk of systemic failure

One such failure Dave Zwieback mentioned in his talk was the leap second bug that hit many online services last year. He categorizes this as a Black Swan Event, a phrase coined by Nassim Nicholas Taleb to describe a (practically) unpredictable and rare event that has an outsize impact. When these sorts of events occur, our natural reaction is to try and convince ourselves that the event COULD have been predicted and prevented.

We think that if only we more rigidly controlled our environment, we could prevent the outage from occurring. We just need more process, more structure, more safeguards to prevent unexpected changes, and our environment will stay in the same, known good, state and we will never have an outage.

Of course, taking these sorts of actions will never actually prevent outages from occurring. In fact, as Zwieback explained in his talk, attempting to prevent unexpected changes from occurring in your environment by creating a more and more restrictive change process will actually grind your company to a halt. This is because, as he puts it, “The changeability of a system is the root cause of both all functioning systems and all malfunctioning systems.”

Making systems more tolerant of unplanned change

We need change in our systems for them to work, both in the pedantic sense of the constant change of 1s and 0s in the CPU, as well as in the practical sense of allowing new features and improvements to be deployed. Fear of change will slow down this rate of feature deployments, which in turn will make the changes that do occur become more and more risky because the scope of change will necessarily increase as the rate of change decreases. As each change becomes more risky, even more process will be added, and the rate of change will decrease even more, leading to a downward spiral until nothing is accomplished and the business stagnates.

What is actually needed is to work towards making your system MORE tolerant of unplanned and unpredictable change rather than trying to prevent it completely. Richard Cook calls this sort of system “Resilient” while Dave Zwieback calls it “Antifragile.” Whatever you call it, you accomplish this by increasing your rate of expected change, which in turn will require your systems to be designed to handle rapid change.

Handling the unexpected variety

If your system can handle rapid EXPECTED change, it will be much better at handling the unexpected variety. Your systems will necessarily have to make fewer assumptions about the current state of the system, and you will have to design it such that it can recover when the assumptions it does make are not correct.

This is, of course, not an easy task, especially in the sort of large scale environments where the cost of any failure is huge. It is much easier to continue on the path of trying to minimize the changes to your system, exerting tighter control. In the short term, this might actually reduce problems; however, it is only increasing the risk of major catastrophic failures down the road.

Walking that narrow path between preventing the short term, smaller, failures and avoiding the long term, unpredictable, failures is certainly a challenging proposition. Of course, challenging problems are why we look forward to coming to work every day.