What we really need is the ability for a scripting language to process massive datasets on massive clusters that just "works" out of the box. Here are the requirements:
- Easy to use - since most of the data munging is given to the intern or the developer who can't code but management doesn't have the guts to fire, its gotta have a low ramp up, maybe 1 or 2 days.
- Cheap, Reliable, redundant storage - You can't afford to lose data
- Scalable - CPU & Storage - you don't want to have to invest more than 10 min of time to process 2x the amount.
- Shared resource - your data is going to be reused by many people, so expect that the cluster is shared and your data is shared.
I'm thinking of using S3 + Hadoop on EC2 + Groovy to accomplish this goal.
- Map reduce + Groovy scripts - easy to use
- S3 - reliable, redundant, cheap storage. It also supports on demand replication, so if your dataset is a hot dataset, it will automatically be replicated to ensure availability
- EC2+Hadoop - its free, nuff said
S3 is infinitely scalable, so no worries about S3. For EC2, if your scripts are taking a long time to run, you just start up a few more instances of EC2.
If everything is "in the cloud", it offers exciting opportunities to outsource these types of projects too, or get paid for creating useful datasets. Devpay has quite a few exciting opportunities.
No comments:
Post a Comment