Skip to content
Greg Case edited this page Jun 10, 2014 · 1 revision

Welcome to the gbin wiki!

Hadoop

Every second of every day, millions upon millions of facts and datapoints are being generated and stored somewhere. As our capabilities of storing tracking and storing this data become more powerful, there is an equal need to be able to analyze and produce actionable information from it. From this, the term "Big Data" was coined. Hadoop is a software framework with a funny sounding name that's been produced to help analyze this enormous data sets and produce some useful nuggets of information.

MapReduce

Although Hadoop encompasses many things, the most central concept is an algorithm called MapReduce. MapReduce processes data by implementing all of its work in two stages. The Map stage simple takes a key/value pair and outputs 0 or more other key/value pairs. The Reduce stage collects 1 or more of those value pairs for a single key, and produces another key/value pair. As way of an example, let's say you figuring out the lunch order for your office. In the map stage, you would ask each person what they want to eat: The Person is the key, and the sandwich they want is the output of the Map function:

Person Sandwich Type
Keith Turkey
Bill Ham
Jeff Turkey
Jana Turkey
Bryce Ham

Now, we need to know how many sandwiches of each type to order. That's where our Reduce function comes into play. An process called the shuffler or partitioner would look at the output of the Map function (the Sandwiches), group them together, and then provide those grouped values to a Reducer. In this example, our Reducer simply counts the number of sandwiches it receives. So our Reducer would be called as follows

Reduce(Turkey, Turkey, Turkey) Returns 3

Reduce(Ham, Ham) -> Returns 2

Finally, the outputs of the Reduce function are collected, and we've generated the following lunch order for our Office

Sandwich Count
Turkey 3
Ham 2

I admit, it's not a terribly impressive example. "Wow, a fancy lunch ordering algorithm," you're thinking. "Surely, we live in an age of technological wonders." But here's the really neat thing about MapReduce. It lends itself very well to problems that are embarassingly parallell.

Parallell Processing

Embarassingly Parallell was cointed that means a big problem can be broken down into many separate smaller problems. Those smaller problems can be worked on in parallell, and then finally all of the results can be combined into the final answer. These types of problems usually mean there is very little communication that needed between the "workers" of the problem. Back to our lunch example, you could have one person take all of the orders on the first floor of the office, and another person collect the orders on the sceond floor. Finally, you combine the order counts of each floor to get your final order. Because these types of problems can be worked on in parallel, that means the work can be split up among different cores of a single machine, or separate machines altogether. And that's where we tie it back to Big Data. When you are looking at attempting to process billions or trillions of datapoints, it suddenly becomes much more manageable when you can split the processing up among dozens or hundreds of machines Combined with virutalization and cloud computing, there isn't much that can't be done.

A Practical? Example

Most of the online Hadoop or MapReduce example tutorials use the textbook word count examples. We're going to do something similar, but try to have a little fun with it. You can download over 200,000 questions from Jeopardy is available online in several different formats. I thought it would be interesting to take the available jeopardy and mine it for the list of topics that come up most frequently. Let's say you wanted to try out for the show and needed to brush up on your geography. What you would want to know is which countries come up most in the show so you know where to best devote your study time. You could imagine similar analysis for most mentioned historical figures, notable lakes and rivers, movie quotes, etc. Granted, we're only talking 200,000 records, but the idea isn't necessarily to work on a Big Data problem right away, but show something that can scale up to Big Data.

First thing we'll need to do is to setup our Hadoop platform. Now, you can certainly download and set this up yourself, but that would be a separate article in itself. Instead, I'd recommend going thru one of the many Hadoop providers out there. Cloudera, Google Cloud Platform, [Amazon,http://aws.amazon.com/elasticmapreduce/] all have solutions, but for this, I picked Hortonworks. Hortonworks actually provides a ready made VM called Hortonworks Sandbox that can be imported into VirtualBox or VMWare. Essentially, you can have a Hadoop instance running locally in a VM within 10 minutes or so. It even includes a bunch of web accessible tools to abstract away some of the HDFS and YARN internals of Hadoop. So, to get started, download the sandbox and follow the installation instructions. I'll wait.

All done! Great. You should be able to open up a browser and point it to http://127.0.0.1:8000/. If all is well, you'll see the following:

Sandbox Start

Secondly, we'll need some data. I've taken the Jeopardy! data and converted it to a tab delimitted file that can be found here. The file contains the dollar value of the question, the question itself, and the answer. In addition, I've generated a file based on data found at Geonames that contains a list of major cities and the countries they belong to.

Using the Sanbox browser tool, upload the 2 files. Your console should look like this:

File Browser

Next, we'll convert these files into tables that can be read into Hadoop (and eventually MapReduced!)

To do so, click on the HCat icon, and choose the Create a new table from a file option. Name the table jeopardy, using the jeopardy_tabbed.txt file you uploaded earlier. Repeat for the city_country_xref.csv file, naming the table city_country_xref. The console will detect the column headings and delimitters for you. You should now see this:

Tables

Now we get to the fun stuff. We're going to use something called Pig, and a scripting language called Pig Latin (this is why you don't let developers come up with names) to declare our MapReduce jobs. Pig provides functions to do most of the typical MapReduce-y type operations without having to write the actual functions themselves, which is usually done in java. Of course, the option to write your own low-level functions is always there, something we may explore in a follow up article.

So click on the Pig icon, and click on "new script", giving it an appropriate title. Now let's start writing our job:

j_data = LOAD 'default.jeopardy' USING org.apache.hcatalog.pig.HCatLoader();

value question answer
1000 This city never sleeps Chicago

This loads our table into an alias called j_data. Each item in j_data will be a tuple consisting of (value, question, answer).

x_ref = LOAD 'default.city_country_xref' USING org.apache.hcatalog.pig.HCatLoader();

city country population
Chicago United States 2695598

Loading our city/country xref table into x_ref. Each x_ref tuple will be a (city, country, population) tuple.

qa = FOREACH j_data GENERATE CONCAT(question, CONCAT(' ', answer)) AS question_answer;

question_answer
This city never sleeps Chicago

Because we want to analyze both the question and the answer for keywords, we concatenate them into a single string. The GENERATE function is used to make our new tuple (question_answer). This is an example of a MAP step. We've mapped a (question, answer) tuple into (question_answer)

words = FOREACH qa GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word;

This is another MAP function. This time, we are tokenizing our String into individual words. So our Map function is actually producing multiple tuples for a single key.

word
This
city
never
sleeps
Chicago

joined = JOIN words BY word, x_ref BY city;

This is where things get interesting. Qe are using our xref tuple we created earlier to find the keywords we care about. Those words we don't care about won't have a match in our city_country_xref and will be effectively thrown out. Here is our new tuple (word, city, country, population), created by the join

word city country population
Chicago Chicago United States 2695598

countries = FOREACH joined GENERATE country;

Throw away the parts we don't need, leaving only a (country) tuple.

city
Chicago
country_groups = GROUP countries BY country;
result = FOREACH country_groups GENERATE COUNT(countries), group;
sorted = ORDER result BY $0 DESC;
top_n = LIMIT sorted 20;

Putting it altogether, here is our final script:

j_data = LOAD 'default.jeopardy' USING org.apache.hcatalog.pig.HCatLoader();

x_ref = LOAD 'default.city_country_xref' USING org.apache.hcatalog.pig.HCatLoader();
qa = FOREACH j_data GENERATE CONCAT(question, CONCAT(' ', answer)) AS question_answer;

words = FOREACH qa GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word;
joined = JOIN words BY word, x_ref BY city;

countries = FOREACH joined GENERATE country;
country_groups = GROUP countries BY country;
result = FOREACH country_groups GENERATE COUNT(countries), group;
sorted = ORDER result BY $0 DESC;
top_n = LIMIT sorted 20;
DUMP top_n;

Clone this wiki locally