Monday, March 31, 2008

A guide to getting 9s behind the decimal point - A guide using flaky web-services

Working at Amazon.com, we obsess about SLA for our services. However, the truth is Service Oriented Architectures are much more susceptible to downtime than you think. Each critical service that a system depends on means more opportunity for failure on a whole. Coupled with the inherent unreliability of the network & hardware in general, this spells a recipe for disaster. In this article, I will describe some non-intuitive techniques that you can increase your uptime.

Problem Statement

Assume you are building a simple application that depends on 10 critical web services to complete a cash transaction. Each webservice guarantees an up time of 99.99% (Acceptable standard for most web applications today). Assuming statistical independence between the webservices, your application will have an uptime of 99.99^10 ~ 99.9%. That means 0.1% of your customers will get turned away. Demanding higher SLAs from your dependent webservices is possible, but you will be hard pressed to find a network+loadbalancer+hardware+OS with an uptime of greater that 99.999%. That means if you plan to have an uptime of more than 99.99%, there is NO way can depend on more than 10 other web-services. In a real situation where you may depend on 100 or more services, your uptime suddenly drops to 99% which is absolutely unacceptable for businesses. 

How to I build a highly available application if the webservice I depend on is flaky ?


4 Non-intuitive ways to increase uptime:


1. Client-Side Caching 

What ? Caching ? Many web services are read-only service that are cachable on the client side. Less calls to service decreases the possibility of running into random problems such as hosts rebooting , network drops, etc. In the toy example of your webservice calling 10 other services, assume that half of your services can have caching turned on with a 70% cache hit rate, your application's up time would increase from 99.9% to 99.935%, a 35% improvement. In reality, most modern applications can achieve much higher hit rates due to larger caches.

Client-side also caching only improves latency, which is again is one of the main reasons why applications fail under load. When an starts getting lots of hits, your application is are more likely to suffer situations such as thread starvation. By reducing latency, you free up your threads to accept more requests. Further more, "surges" of internet traffic are usually caused by many people trying to "do the same thing", hence your cache hits increase too.

Moral of the story - investing in caching makes sense. 

2.  Fallback logic

Face it, its better to have "dumb down" behavior rather than a user experience that results in an error page. In the case that a service fails, its better to "fallback" to some simpler logic built into the client application. If fallback logic was used to calculate things that involve money, your application can note that a fallback was used, and re-invoke the failed operation at a later time. If there is a correction that needs to be made, you can notify the user retroactively, and ask them to re-confirm their decisions. Although offering "client side logic" to your clients defies all logic about webservices, clients that are truly interested high availability will see the value of being able to run in a "dumb down" way.

3. Avoid Enterprise Service Buses - Stick with HTTP + DNS + Hardware Load-balancer

Despite the fancy presentations by ESB providers, ESBs simply have not proven to scale. If the web-service you depend on insists on using a ESB, avoid it like the plague. Further more, proven standards mean you get a good choice of technology that leverage this technology (Apache / Jetty / Tomcat / data center fail over) that has been well vetted out. 

4. Use DNS Round Robin

The most likely point of single point of hardware failure is the load balancers. I repeat, the mostly likely point of singe point of hardware failure is hardware load balancers. Unless your company has a load of money to spend on software load balancers or client side load-balancers research (clients collectively deciding how to route requests to separate boxes), a web-service is  probably better off having redundant load balancers with DNSs pointing to both load balancers. Web-service clients to set the connection timeout settings to be low (e.g. 50 ms) and to re-use connections. In the event of a load-balancer failure, clients would time-out trying to connect to the failed load-balancer and succeed making a connection on the second. Because the connections are re-used, subsequent calls will use the working load-balancer. Insisting that your web-service provided support DNS round-robin & redundant load-balancers is usually a minor investment they are willing to do.


Conclusion

By thinking a little harder, there are ways to drastically improve the availability of your application even if you are not in control of the underlying web-services.

Wednesday, March 19, 2008

Replacing dataware houses with grid enabled scripts

I really don't like my dataware house - its slow, relies on way too much SQL voodoo, and overburdened. Like many teams at Amazon, almost everyone has their own (Perl/Ruby) scripts that munge and spits out data. However these scripts really only scale to about 1~2 gigs of data. Any more than that, its time to hit the dataware house. After teams get fed up with their datawarehouse, they build their own not so good dataware house.

What we really need is the ability for a scripting language to process massive datasets on massive clusters that just "works" out of the box. Here are the requirements:
  • Easy to use - since most of the data munging is given to the intern or the developer who can't code but management doesn't have the guts to fire, its gotta have a low ramp up, maybe 1 or 2 days.
  • Cheap, Reliable, redundant storage - You can't afford to lose data
  • Scalable - CPU & Storage - you don't want to have to invest more than 10 min of time to process 2x the amount.
  • Shared resource - your data is going to be reused by many people, so expect that the cluster is shared and your data is shared.
The Design:

I'm thinking of using S3 + Hadoop on EC2 + Groovy to accomplish this goal.
  • Map reduce + Groovy scripts - easy to use
  • S3 - reliable, redundant, cheap storage. It also supports on demand replication, so if your dataset is a hot dataset, it will automatically be replicated to ensure availability
  • EC2+Hadoop - its free, nuff said
The idea would be that all data always lives on S3. You can write a groovy script that does data munging. When launching the script, the script gets copied to the master of the cluster to get executed. Instead of reading / writing from disk, you initially copy your data to your hadoop cluster, churn away at the data, and put the data back into another S3 bucket. If your script needs to do multiple map-reduces (probably will), the data would remain in the HDFS until the final map-reduce.

S3 is infinitely scalable, so no worries about S3. For EC2, if your scripts are taking a long time to run, you just start up a few more instances of EC2.

If everything is "in the cloud", it offers exciting opportunities to outsource these types of projects too, or get paid for creating useful datasets. Devpay has quite a few exciting opportunities.

Hadoop Groovy XML Example

I've been following the Hadoop project closely. Hadoop is an open source implementation of MapReduce. However, its been frustrating me that Hadoop doesn't make it easy to "run a script" on large sets of XML.

So by somewhat popular demand - example of how to use groovy with Hadoop and XML.

This example does two things:
- shows how to use MapReduce with Groovy
- shows how to parse XML input sources correctly using a StAX parser

Download the example :
http://s3.amazonaws.com/groovybucket/GroovyHadoopExample.tar.gz

Monday, March 17, 2008

Resources can be functions too

When implementing REST based systems, REST purists balk at the idea of sending data through a POST and getting a response. They say "THATS NOT REST - GO USE SOAP INSTEAD".

My response is that resources don't have to be static pieces of data, similar to how variables can be closures. Take for example this resource "http://myresource/tax-calulator/calculate". This resource accepts requests with both the price & destination state and returns the calculated tax amount.

By embracing the fact that functions can be resources, you can do really interesting things. For example, lets say you want to create a resource that calculates taxes in NJ. You could simply create a new resource "http://myresource/tax-calulator/NJ/calculate" that always passes NJ as the destination to "http://myresource/tax-calulator/calculate". This paradigm does not violate REST at all.

I think once the REST community wraps there mind that resources can be functions too, REST can effectively replace other frameworks such as SOAP or XML-RPC

Sunday, March 16, 2008

Rethinking Digital Security

This entry is offers an alternative to how we should think about security. The idea actually originated from Christine Komatsu - a design student that is working on a project to rethink how we perceive the basic notion locks and keys.

Digital Security today and why it is flawed:


There are 3 components that all computer systems rely on to enforce security
  1. Authentication - This is usually the first step, a computer needs to know that you are you.
  2. Authorization - OK, I verified you are you. Computer systems now need to control what resources a user can access or cannot access. Users must also be able to "delegate" access to resources they own to others.
  3. Private communication - in order to not compromise authentication or authorization, computers must have a secure and anonymous mechanism to communicate to users.
However, this model is totally useless for real world systems like securing your house. A house has no concept of your identity or authorization, it only know about locks and keys. If you are afraid that you might lose your key, you duplicate your key and give it to your friend. If you need more security, buy a bigger and stronger lock. If your house has been compromised, change your locks and get new keys.

The reason why we might want to use this way to think about security is:
  1. People who want to compromise your resource will focus only on stealing your key, not your identity. If your resource is compromised, you don't lose your identity.
  2. Keys can be replicated, and distributed - identity cannot. Even if you delegate resource access to someone else, if your identity has been compromised, your friend cannot re-delegate control of the resource back to you.
  3. Many locks of different strengths is much more desirable than one lock that can open all doors.
  4. Its hard for users to think about having multiple identities - its easier to think in terms of owning multiple keys.
  5. Delegation models are extremely difficult for the average person to think about.
  6. People are leary of single sign-on - If digital security is based on identity, and your identity is compromised, all your resources are open to thieves.
I believe if we rethink digital security in terms of locks and keys, users can strike a better balance between accessibility, manageability , and security.


How such a system would work:


In the new way of think about security, there are 3 number activities:
  1. Creating new keys
  2. Creating new locks
  3. Putting locks on resources
  4. Opening & Closing a lock
Creating new keys

The idea is that every person would have one or more keys. Keys come in many different forms - they can be a password, a RFID tag attached to your shoe, a physical device with an encrypted password (e.g. id card), or a rotating token key. Hence keys can be either created on a computer (e.g. a password), bought in a store (e.g. RFIDs), etc.

Creating new locks

Users should be able to create digital locks should have the following properties:
  1. They can require 1 or more keys to be opened - To enhance security, a user may want to require more that 1 key to open a lock. E.g. a user may want to make sure their lock on their door is protected by both a key card as well as the RFID embedded in their shoe. Since most people in North America wear shoes, it ensures that a thief cannot just steal your wallet but also your shoe in order to enter your house
  2. They can be opened with multiple combination of keys - To enable recovery of losing keys, a user can specify that a different combination of keys (e.g. 2 different key cards give to two different friends) can open the lock. This way, if the user loses their own keys, they can user can retrieve the 2 key cards from their friends to open the door. Neither friend can open the door themselves, hence a friend losing a key is not of concern.
  3. Locks can be created anywhere - to improve security, resources should be able to create locks themselves. This requires that the keys are available to the resource at the time that the lock is created. Instead of having a central "lock factory" that could be compromised, keys stolen, locks duplicated, etc, having the lock being created on the resource itself increases security. For example, a user may want the lock to be created on the safe, and not created on the internet.
  4. Locks know the id of the keys that can open it
  5. Locks have a unique publicly visible ID - All locks should be identifiable so that the user can remember what keys.
Adding/removing locks from resources

Putting locks on resources is like putting locks on doors. Digital locks are different because they can be put on multiple resources. Resources must abide by the following rules:

1. Resources can require multiple locks for access
2. The number of locks required to access a resource must be set at the time that the resource is open.
3. Resources can copy locks from other resources
4. Resources can only have locks added / removed only when the resource is open.
5. Resources can further set that locks can be added / removed only with a certain key combination.

Opening and closing locks:

At the time that a lock is open:
  1. The keys and the lock must have a secure mechanism to communicate to each other. This is necessary to prevent possible duplication of keys.
  2. Opening of the lock is user initiated to prevent accidental opening of the lock.
Some use cases:

1. Securing the house for a family of 4
  • User goes out to purchase 6 keys cards and 4 RFIDs, and creates 1 password key
    • 4 of the 6 key cards are identical. (KC_ID 1)
    • 2 of the key cards are unique (KC_ID 2 & 3
    • 4 RFIDs are identical (RF_ID 1)
    • Password Key (PW_ID 1)
  • Mom create a lock that can be opened with 3 key combinations on her iphone
    • KC_ID 1 & RF_ID 1
    • PW_ID & RF_ID 1
    • KC_ID 2 & KC_ID 3
  • Mom hands out the keys
    • Each family member gets a KC_ID1, and places a RFID in their shoe
    • Each family member is told the password and told to not tell anyone else.
    • KC_ID 2 & 3 are handed out to good friends, and are told never to admit to having the key to other friends unless the entire family dies
  • Mom puts copies the digital lock from her iphone to all 4 doors of the house
  • Mom removes digital lock from her iphone
  • Pros:
    • Family members will be able to get into the house, even if they forget their key card. If they are really stupid and forget their shoe or password, the family member can just wait for another family member
    • Family members can contact friends to get KC_ID 2 & 3 in the even that they cannot wait for their family member.
    • Family members will know really soon if their house is compromised - there is very little chance that a robber will steal both the key cards and the shoes without them knowing it.
2. Securing the Internet in public and private places.
  • Sally has a key_card (KC_ID 1) and a secret password for herself (PW_ID 1). Sally has a RFID (RF_ID 1) in her shoe. Sally's monitor at home has a key card built in (KC_ID 2).
  • Sally creates a lock that can be opened with 2 key combinations:
    • KC_ID 1 & RF_ID 1 & PW_ID1 - Public internet access
    • KC_ID 2 & PW_ID1 - Private internet access (upstairs bedroom)
  • Sally puts the lock on all websites that she has private information
  • In public, Sally usually has her wallet (with KC_ID 1), shoes, and usually remembers her password. Sally is able to access the internet outside.
  • In private, Sally doesn't wear her shoes and frequently leaves the wallet downstairs. Because the KC_ID2 is built right into her monitor, she still can surf the internet at home.
  • Pros:
    • Password isn't the only thing that protects you, even if your password is compromised outside (e.g. keyboard logger), the thief needs Sally's shoes and wallet
    • Sally's home computer is protected by a door. Even if a thief breaks into the house, they need to know Sally's password to access her internet account.
Conclusion:

By rethinking digital security using locks and keys, I believe that we can simultaneously improve usability, manageability, and security in this digital age.