Monday, March 31, 2008

A guide to getting 9s behind the decimal point - A guide using flaky web-services

Working at Amazon.com, we obsess about SLA for our services. However, the truth is Service Oriented Architectures are much more susceptible to downtime than you think. Each critical service that a system depends on means more opportunity for failure on a whole. Coupled with the inherent unreliability of the network & hardware in general, this spells a recipe for disaster. In this article, I will describe some non-intuitive techniques that you can increase your uptime.

Problem Statement

Assume you are building a simple application that depends on 10 critical web services to complete a cash transaction. Each webservice guarantees an up time of 99.99% (Acceptable standard for most web applications today). Assuming statistical independence between the webservices, your application will have an uptime of 99.99^10 ~ 99.9%. That means 0.1% of your customers will get turned away. Demanding higher SLAs from your dependent webservices is possible, but you will be hard pressed to find a network+loadbalancer+hardware+OS with an uptime of greater that 99.999%. That means if you plan to have an uptime of more than 99.99%, there is NO way can depend on more than 10 other web-services. In a real situation where you may depend on 100 or more services, your uptime suddenly drops to 99% which is absolutely unacceptable for businesses. 

How to I build a highly available application if the webservice I depend on is flaky ?


4 Non-intuitive ways to increase uptime:


1. Client-Side Caching 

What ? Caching ? Many web services are read-only service that are cachable on the client side. Less calls to service decreases the possibility of running into random problems such as hosts rebooting , network drops, etc. In the toy example of your webservice calling 10 other services, assume that half of your services can have caching turned on with a 70% cache hit rate, your application's up time would increase from 99.9% to 99.935%, a 35% improvement. In reality, most modern applications can achieve much higher hit rates due to larger caches.

Client-side also caching only improves latency, which is again is one of the main reasons why applications fail under load. When an starts getting lots of hits, your application is are more likely to suffer situations such as thread starvation. By reducing latency, you free up your threads to accept more requests. Further more, "surges" of internet traffic are usually caused by many people trying to "do the same thing", hence your cache hits increase too.

Moral of the story - investing in caching makes sense. 

2.  Fallback logic

Face it, its better to have "dumb down" behavior rather than a user experience that results in an error page. In the case that a service fails, its better to "fallback" to some simpler logic built into the client application. If fallback logic was used to calculate things that involve money, your application can note that a fallback was used, and re-invoke the failed operation at a later time. If there is a correction that needs to be made, you can notify the user retroactively, and ask them to re-confirm their decisions. Although offering "client side logic" to your clients defies all logic about webservices, clients that are truly interested high availability will see the value of being able to run in a "dumb down" way.

3. Avoid Enterprise Service Buses - Stick with HTTP + DNS + Hardware Load-balancer

Despite the fancy presentations by ESB providers, ESBs simply have not proven to scale. If the web-service you depend on insists on using a ESB, avoid it like the plague. Further more, proven standards mean you get a good choice of technology that leverage this technology (Apache / Jetty / Tomcat / data center fail over) that has been well vetted out. 

4. Use DNS Round Robin

The most likely point of single point of hardware failure is the load balancers. I repeat, the mostly likely point of singe point of hardware failure is hardware load balancers. Unless your company has a load of money to spend on software load balancers or client side load-balancers research (clients collectively deciding how to route requests to separate boxes), a web-service is  probably better off having redundant load balancers with DNSs pointing to both load balancers. Web-service clients to set the connection timeout settings to be low (e.g. 50 ms) and to re-use connections. In the event of a load-balancer failure, clients would time-out trying to connect to the failed load-balancer and succeed making a connection on the second. Because the connections are re-used, subsequent calls will use the working load-balancer. Insisting that your web-service provided support DNS round-robin & redundant load-balancers is usually a minor investment they are willing to do.


Conclusion

By thinking a little harder, there are ways to drastically improve the availability of your application even if you are not in control of the underlying web-services.

No comments: