Wednesday, March 19, 2008

Replacing dataware houses with grid enabled scripts

I really don't like my dataware house - its slow, relies on way too much SQL voodoo, and overburdened. Like many teams at Amazon, almost everyone has their own (Perl/Ruby) scripts that munge and spits out data. However these scripts really only scale to about 1~2 gigs of data. Any more than that, its time to hit the dataware house. After teams get fed up with their datawarehouse, they build their own not so good dataware house.

What we really need is the ability for a scripting language to process massive datasets on massive clusters that just "works" out of the box. Here are the requirements:
  • Easy to use - since most of the data munging is given to the intern or the developer who can't code but management doesn't have the guts to fire, its gotta have a low ramp up, maybe 1 or 2 days.
  • Cheap, Reliable, redundant storage - You can't afford to lose data
  • Scalable - CPU & Storage - you don't want to have to invest more than 10 min of time to process 2x the amount.
  • Shared resource - your data is going to be reused by many people, so expect that the cluster is shared and your data is shared.
The Design:

I'm thinking of using S3 + Hadoop on EC2 + Groovy to accomplish this goal.
  • Map reduce + Groovy scripts - easy to use
  • S3 - reliable, redundant, cheap storage. It also supports on demand replication, so if your dataset is a hot dataset, it will automatically be replicated to ensure availability
  • EC2+Hadoop - its free, nuff said
The idea would be that all data always lives on S3. You can write a groovy script that does data munging. When launching the script, the script gets copied to the master of the cluster to get executed. Instead of reading / writing from disk, you initially copy your data to your hadoop cluster, churn away at the data, and put the data back into another S3 bucket. If your script needs to do multiple map-reduces (probably will), the data would remain in the HDFS until the final map-reduce.

S3 is infinitely scalable, so no worries about S3. For EC2, if your scripts are taking a long time to run, you just start up a few more instances of EC2.

If everything is "in the cloud", it offers exciting opportunities to outsource these types of projects too, or get paid for creating useful datasets. Devpay has quite a few exciting opportunities.

No comments: