Reliability on the cloud

When a grandmother’s foodblog has better availability than that expensive business website, CIOs feel worse than foolish — they feel incompetent. Better to join the bandwagon of 99.99999% reliability.

Image by DALL-E-2 “grandmother blog sunrise sketch style”

I’m a declared cloud enthusiast but I’ve spent enough time with complex setups on and off the cloud to have seen the warts on both sides. I mentioned in a previous post that there were ONLY five reasons to move to the cloud (though they are very compelling reasons); this post is all about the second of those.

Back to those trailing nines that make CIOs feel incompetent. Much noise is made about how the cloud is so much more reliable than on-premise, and in practice, it does indeed seems to pan out. However, like most things in life, there are some nuances.

About that 99.99999%. It’s nonsense.

Ok, let’s add the nuance. The seven nines are for AWS or GCP as a whole, not for your personal setup or your grandmother’s blog. Your personal setup is what is called a “single instance” and here the clouds offer as an SLA is no different from what (say) Dell offers on-premise — a rather unexciting 99.5%. One can be down for nearly four hours in a month before alarms go off in the SLA world. Indeed for a private setup with two servers load balanced, GCP offers an SLA of 99.99% — a little better, but hardly earth-shaking.

So what’s the fuss all about? Why am I saying reliability is a good reason to move to the cloud?

I’m going to go three ways in understanding reliability on the cloud. First I’ll look at what instance-reliability really means, then explore application resilience and finally look at some unique cloud propositions that provide rocket boosts to reliability.

Instance Reliability

Back to my grandmother. In practice, her cloud-hosted blog is going to be way more reliable than the SLA of 99.5% seems to suggest; indeed I doubt she would see any downtime at all. Remember, there’s no particular magic underlying the cloud — in the end they’re just servers in data centers with lots of virtualization layered on. On-premise, your support provider will offer 99.5%, so why does it surprise you that cloud offers the same? AWS, in most cases, is designed (not SLA) with 99.95% availability — nicer but still not great.

Here comes the load balancer, which turns 99.95% into 99.999975%. Of course, the load balancer itself has an availability (99.99% in AWS) but on the cloud, life is a bit easier — there are multiple zones and ways to reroute traffic past a failed load balancer. Indeed, it’s pretty easy to set up a configuration of servers and load balancers that auto-recover from failure and thus have those magical seven nines of reliability. No different, really, from on-premise (it would just be a lot more work there and you have to first purchase all the pieces, zones and data centers).

This is the crux of the matter. Today, the cloud allows you to create huge clusters where individual components may fail but the cluster keeps going— and web-based applications are particularly suited to easy clustering. That’s how the 99.99999% becomes a reality for you.

I must also mention that 3-tier apps have a slightly lower availability than single-tier apps since all three tiers must be available for a 3-tier one to work. I’ll discuss this later in the security post, but you’re better off merging all the layers (or at least the web and app layers).

Application Reliability

If you were a little disappointed by the reliability of servers on the cloud relative to your data center, let me cheer you up by talking application. Now applications don’t fail the same way as hardware — they crash, hang, choke or become unresponsive — and can often be restored with a restart. If so designed, applications can be horizontally scaled and health-check services can just shut down or restart failed instances of an application while the other instances continue. Tied to an autoscale architecture, the whole thing can become self-healing.

The cloud makes this particularly easy with all the configuration flexibility it offers. Autoscale, availability zones, routing choices, continuous health monitoring — all these servers are readily available and easy to stitch together into an application configuration that is pretty bulletproof either to code issues or to load problems. Databases can play a bit of a spoilsport here — they’re hard to scale beyond a point — but the cloud conveniently offers caching services that make life easier.

Then there are CI/CD services that make things like rolling application patching and code deployment a breeze, thus getting rid of those pesky “planned” downtimes as well.

The real magic of the cloud in ensuring application availability, really, is that it offers all these tools on offer for you to take advantage of. I can tell you from long experience — all this is possible on-premise as well but assembling all these pieces together usually prove incredibly hard and take forever.

Service Reliability

Modern applications aren’t applications in the old sense; they’re more like conductors of an orchestra of services. And this is where the cloud shines — there are a lot of services to orchestrate. Instead of the traditional application running on a server (or if you’re a bit more modern now a container) doing all kinds of complicated things you now have a collection of “micro” services doing everything from computing to getting coffee.

The thing about the cloud is — many of these services come ready to be consumed. From what I have seen of the world, there are two camps — the modern camp that consumes what is already built and the traditional camp that wants to do everything themselves (including their own versions of those services). Take even simple things like sending an email — most companies take the effort to set up their own SMTP servers, calling it a microservice and feeling quite pleased when it works. And they do this even on the cloud.

You want reliability, use the standard services. If you can avoid running it, avoid running it (and you almost always can). Emails, SMS, queues, caches, health checks, serverless functions, the more your application architecture consumes but does not run the more reliable things will be. Since the individual services provided by the cloud tend to be highly reliable (>99.99% and with auto-recovery built-in), everything is less likely to fail.

Take storage, for instance. If you’re using S3 instead of EBS you will get a huge boost in durability. General Purpose EBS storage is familiar to servers and application architects (who are often unnecessarily obsessed with IOPS) but is only 99.8% durable (you can get much more but at a cost). S3 on the other hand automatically keeps multiple copies of each object, so is able to offer durability (the ability to ensure an object is not destroyed) of 99.99999% even at the base offering. And there’s also this rather useful ability of object storage to act as an extremely reliable static web server.

One thing I highly recommend — refactoring your application to make much greater use of lambda functions. It is one of the most under-used things in the cloud arsenal (and quite inexpensive to boot). Many things that people program behind an app server can become lambda functions rather easily while dramatically boosting app performance and reliability.

One thing I don’t recommend — relational database services. Sure they’re very reliable but RDBMs applications tend to be quite noisy so the meter ticks fast and furiously. Use it in a pinch but be very careful.

Summary

If you want to toss nines into the face of your CEO, the cloud really does offer a lot. Don’t repeat what you learned a decade ago and expect magical answers, but with a bit of work and some rethinking, my grandmother’s foodblog is not out of reach. What the cloud allows you to do much better than on-premise is to create an application that scales horizontally, auto-responds to demands on capacity and auto-recovers from failure. Add to this a touch of automated CI/CD and a dollop of automated testing and you’ve cooked up a failproof app.

One caveat, though. My grandmother (rest her soul) died long before blogs were invented and any resemblance to any other grandma is purely accidental.