Availability in an unreliable cloud

One of the first problems to solve when moving to the cloud is availability, or just plain uptime. Although it seems strange that the cloud should not be stable, but this is one of the constraints that is an enabler for solid solutions. By accepting unexpected service interuptions as a design requirement, the whole solution benefits. You have to design for failure and outages. Think Netflix and their simian army.

simian

Outages in the cloud can be placed in two categories, failures or maintenance, and within maintenance we have planned and unplanned maintenance. To deal with this, and also to meet the SLA requirements, Microsoft has divided Azure into update domains and fault domains. Update domains are related to restarts due to maintenance and upgrades, while fault domains are closely related to HW, such as a rack or power unit https://azure.microsoft.com/nb-no/documentation/articles/virtual-machines-manage-availability/. It is therefore important to have multiple instances, at least one in each fault domain, preferably also one in each update domain. Once you have enough instances you will survive downtime.

Next step on the availability list is to support datacenter outages. To deal with this, one can start using multiple datacenters, potentially from different cloud providers, i.e. use one Azure DS and one Amazon DS. Your ability to move between cloud vendors will normally be limited by the choice of technology and implementation. Architecting for multi-vendor support will require a lot more work when it comes to operations and management. Any SAAS platform must have an equivalent offering with both vendors, not to mention how you will work with the different performance characteristics.

Service overview

Here's a short list of some of the Azure services and how replication works.

By default SQL is replicated to 3 severs in each datacenter, on the premium level a SQL database can be set to active replication. Also, the secondary SQL servers can be used as a read source, however this may hold locks and cause transactions to fail on the master. https://azure.microsoft.com/en-us/documentation/articles/sql-database-geo-replication-overview/#active-geo-replication-capabilities

Service	Comments
Azure storage	Azure storage comes in four different flavors: From https://azure.microsoft.com/nb-no/documentation/articles/storage-redundancy/. Note that Zone Redundant storage is only available for block blobs.
Azure SQL
DocumentDb	Each collection in DcumentDb resides on multiple servers, however there's no built in geo-replication.
Redis	The basic Redis offering is not replicated. At the standard level there's a master and a slave. And in the recently announced Premium tier there's support for Redis clustering, sharding data over multiple nodes. https://azure.microsoft.com/en-us/documentation/articles/cache-how-to-premium-clustering/
Search	Search is divided in replicas and partitions. For read-write and indexing availability, at least 3 replicas are needed. Partitions relates to the storage volume.
Cloud services	The number of instances determines how good you can handle outages. You need at least one in each fault domain, preferably at least instances.
Cloud services/Virtual machines	The number of instances determines how good you can handle outages. You need at least two or more in an availability set/one in each fault domain..

As we see, all of these services provide some sort of replication. Some offer geo redundancy, but in many cases you will not be able to control this in details yourself. From a performance perspective, this landscape changes with each server, as an example can storage be manually sharded while this is an integral part of DocumentDb.

Also, while a few of these services provide geo replication, supporting writes is a completely different story. Especially if we go down the SAAS path, handling writes during outages are not trivial. It may often seem easier to use a PAAS and then run software designed for multiple datacenter usage on top of that instead. I've started writing on a post where we'll go through a sample setup on how we can achieve a solution built with Azure SAAS and still survive a datacenter outage.

Harald Schult Ulriksen

Deploying Orleans to Kubernetes

Notes on the new DocumentDB partitioning and pricing

Unit testing Redis Lua scripts

Single digit response times