SaaS Talk: Hadoop Groovy XML Example

Wednesday, March 19, 2008

Hadoop Groovy XML Example

I've been following the Hadoop project closely. Hadoop is an open source implementation of MapReduce. However, its been frustrating me that Hadoop doesn't make it easy to "run a script" on large sets of XML.

So by somewhat popular demand - example of how to use groovy with Hadoop and XML.

This example does two things:
- shows how to use MapReduce with Groovy
- shows how to parse XML input sources correctly using a StAX parser

Download the example :
http://s3.amazonaws.com/groovybucket/GroovyHadoopExample.tar.gz

2 comments:

Ted Dunning ... apparently Bayesian said...: I don't see any groovy source in your example.

Did you mean to have some?

Also, I have implemented another groovy framework for hadoop. I am curious if you would like to work together on helping groovy work with hadoop. My goal is to be able to write simple programs simply.

Here is the read-me:

OVERVIEW

Grool is a simple extension to the Groovy scripting language. Itis intended to make
it easier to use Hadoop from an scripted environment and to make it possible to write
simple map-reduce programs in functional style with much less boiler-plate code than
is typically required. Essentially, the goal is to make simple programs simple to write.

For instance, the venerable word-count example boils down to the following using grool:

count = Hadoop.mr({key, text, out, reporter ->
text.split(" ").each {
out.collect(it, 1)
}
}, {word, counts, out, reporter ->
int sum = 0
counts.each { sum += it }
out.collect(word, sum)
})

When we call the function count, it will determine whether its input is local or already
in the Hadoop file system and will invoke a map-reduce program. The location of the output
is returned packaged in an object that can be passed to other map-reduce functions like count
or read directly using familiar Groovy code constructs. For instance, to count some literal
text and print the results, we can do this:

count(["this is some test", "data for a simple", "test of a program"]).eachLine {
println(it)
}

As an example of composition of functions, we can write a simple variant of the word
counting program that counts the prevalence of counts in each decade (1-10, 10-100
and so on). This can be written this way

decade = Hadoop.mr({key, text, out, reporter ->
out.collect(Math.floor(Math.log(new Integer(text.split("\t")[1]))), 1)
}, {decade, counts, out, reporter ->
int sum = 0; counts.each { sum += it }
out.collect(decade, sum)
})

These two programs can be composed and the result printed using familiar functional style:

decade(count(text)).eachLine{
println it
}

When we are done, we can clean up any temporary files in the Hadoop file system using the
following command:

Hadoop.cleanup()

If we were to write code that needs line by line debugging, we can change any invocation of
Hadoop.mr to Local.mr and the code will be executing locally rather than using Hadoop. Local
and Hadoop based map reduce functions can be intermixed arbitrarily and transparently.

KNOWN ISSUES and LIMITATIONS

As it stands, grool is useful for some simple tasks, but it badly needs extending. Some
of the limitations include:

- all input is read using TextInputFormat (easily changed)

- combiners, partition functions and key sorting aren't supported yet (easily changed)

- there is essentially no documentation and very little in the way of test code. The
current code is more of a proof of concept than a serious tool.

- the current API doesn't allow multiple input files beyond what a single glob expression
can specify (the current argument parsing code is fugly and should be replaced)

- there is some speed penalty due to the type conversions performed at the Java/Groovy
interface and because Groovy code can be slower than pure Java. Currently the penalty
appears to be at most about 2-3x for applications like log parsing. In my experience, this
is less than the cost of using, say, Pig. (improving this may be very difficult to do
without sacrificing the simplicity of writing grool code, but you can provide a Java
mapper or reducer if you like to avoid essentially all of this overhead)

- there are bound to be several horrid infelicities in the API as it stands that make grool
really hard to use in important cases. This is bound to lead significant and incompatible
changes in the API. Chief among the suspect decisions is the fact that Hadoop.mr returns
a closure instead of an object with interesting methods.

HOW IT WORKS

The fundamental difficulty in writing something like grool is that it is hard to write a
script that executes code remotely without expressing the remote code as a string. If you
do that, then you lose all of the syntax checking that the language.

I wanted to use a language like Python or Groovy, but I wasn't willing to give up being
able to call the map function easily for testing before composing it into a map-reduce
function. Unfortunately, languages like Groovy and Python don't provide access to the
source of functions or closures and, at least in Groovy's case, these functions may
refer to variables outside of the scope of the function which would mean that the text
of the function wouldn't make sense on the remote node anyway.

The solution used by grool to get around this difficulty is to execute the ENTIRE script
multiple times on different machines. Depending on where the script is executing, the
semantics of the functions being executed differs. When the script is executed initially
on the local machine, it primarily copies data to and from the Hadoop file system,
decides what the names of temporary files should be, configures Hadoop jobs and invokes
them. When executed by a mapper configure method, the script executes all initialization
code in the script and then saves references to all of mapper functions defined in the
script so that the map method can invoke the correct function. Similar actions occur
in the configure method of each reducer.

This can be a bit confusing because references to external variables don't really refer
to the same variables. In addition, some code only makes sense to execute in the correct
environment. Mostly, this involves code such as writing results which is handled by a
small hack where all map-reduce functions return a reference to a null file when executed
by mappers or reducers. That makes any output loops such as in the simple examples above
execute only a single time. Some other code may require more care. To help with those
cases, there is a function Hadoop.local that accepts a closure to be executed only on
the original machine.

Less serious than the issue of how to move closures to far-away nodes is the issue of storage
transparency. It is important for simplicity that the user not be too concerned with where
their data is. If it is local and needs to be distributed, that should made to be so as if
by magic. Likewise, if function results are distributed and they are desired locally, the
data should be moved behind the scenes. They way that this is done in grool is that whenever
distributed storage is needed, a location in /tmp is picked, essentially at random. These
results can be renamed to whatever permanent location is appropriate by the programmer. This
does lead to the accumulation of goo in the /tmp directory, but scripts can and should delete
any temporary files on exit. Since the grool framework remembers all of the things it created,
this is pretty easily done.

HOW TO RUN

I recommend you use IDEA from JetBrains (often called IntelliJ). I have provided an ant build
file as well, but have not tested it. Result of compiling this code should be a jar file
called grool.jar that can be executed using hadoop -jar grool.jar. This program expects a
single additional argument which is the name of your script. If you want to use the debugger
in IDEA, you can step into your groovy code without using Hadoop.

There are probably bunches of problems you will run into in trying to do this. Please contact
me for help as tdunning@veoh.com

LICENSING

All of this code is copyright Veoh Networks and is licensed under an Apache style license.
Please contribute back changes and improvements under the same terms. Any source files that
do not contain a copyright notice should be considered to have the following notice attached
them:

Copyright 2008 Veoh Networks

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.; March 19, 2008 at 8:27 AM
Alan Ho said...: The source is actually in S3 - you can download them from these locations:

http://s3.amazonaws.com/groovybucket/map.groovy
http://s3.amazonaws.com/groovybucket/reduce.groovy; March 19, 2008 at 9:04 PM

SaaS Talk

Wednesday, March 19, 2008

Hadoop Groovy XML Example

2 comments:

Blog Archive

My Favorite blogs

What do you hate

About Me

SaaS Talk

Wednesday, March 19, 2008

Hadoop Groovy XML Example

2 comments:

Blog Archive

My Favorite blogs

What do you hate

Subscribe To

About Me