Patterns and Practices for Cloud development

Earlier this month, the developer team at Microsoft Norway managed to hijack Masashi Narumoto and Christopher Bennage from the Patterns and Practices team while they were in the UK. The short stop in Oslo was highly appreciated with a packed theater.
WP_20151112_09_46_13_Pro

Now, we might all have our opinion on the Enterprise Library which by the mere name can invoke horror and shrugs. However, when it comes to the cloud their approach seems to be more aligned with modern development. They have two major artifacts to start from, one is the Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications which is a book available online. Besides the book they also publish code and artifacts on github, which also contains work in progress.

Although it was perhaps the weakest part of the presentation in Oslo, talking about the constraints and problem areas when moving to the cloud is quite important. You can find a list of these here https://msdn.microsoft.com/en-us/library/dn589772.aspx. We have experienced several of these ourself , and have found our solution for quite a few of them. A short introduction to some of these are in the table below.

Problem	Why	What we've done
Availability	Things die, loose connection or reboot during updates.	To solve this, we have multiple instances in each datacenter. We try to use at least four instances for all end-user facing services. Also, it must be possible to take out and add new instances without affecting user data such as state. This is easier with a slim user-facing layer rather than a monolith. As there's a possibility datacenters goes down as well, we have our solution active in multiple datacenters at a time.
Data consistency	To survive multiple datacenter outages, data must be available in many places. How do you keep it in sync?	To keep data in sync and also allow writes during outages and then re-sync, we've decided to use topics and queues. We're also positioned so that we can do complete re-sync of internal data to the cloud, having idempotent operations is a part of this solution. We can always replay a message to create the final result.
Messaging	Split a monolith over processes and application boundaries. What do you do with asynchronous operations?	Async messaging means that we can perform a minimal amount of work on each request providing fast responses. A fast response also means that each server handles fewer simultaneous requests, or at least limits the request queue. On every write operation we do a quick validation of the request before passing a command onto the message bus and returning a 200 or 204 response. This enables us to take the database offline for upgrades, the queue also protects our database and underlying systems from overload, and we can trigger multiple actions on each message.
Management and monitoring	Splitting your applications into smaller parts, or microservices does mean there's more knobs and gauges.	We're logging a lot. As the Azure Application Insights was not available when we started out, we collect most of our data in ElasticSearch with Kibana on top. We have created several custom dashboards monitoring queues, response times etc. Finally most of our services expose a healthcheck endpoint which delivers red/green status information on underlying services.
Performance	No reason to pay for fixed size when one can scale after usage	We have a quite fixed usage pattern. It starts early in the morning with radio streaming, and the kids also watch a bit of TV before kindergarten. During daytime our radio users switch from mobile to desktop listening. After school is the start of our peak period which lasts close to midnight. Having full throttle for during night and daytime does not make much sense. Utilizing time based scaling combined with some extra power during elections etc. makes the cloud a nice place. However, measuring is very important.

The focus on problem areas during the patterns and practices summit in Oslo served as a primer and background for the continuing discussing on the specific patterns. Some of them are in Release It!, by Michael T. Nygard, as well. During the conference, topics discussed were Circuit breaker, Messaging, Polyglot storage, CDN and Caching. I will write about all of these and how we have applied them.

My next post will deal with outages and availability and should be ready tomorrow.

Harald Schult Ulriksen

Isolating failures - deliver what you can

Single digit response times

Black Friday failure

Black Friday failure