DNSC 6920 - Big Data - Summer 2017
Prof. Marck Vaisman
Assignment # 1
Due Sunday June 18, 2017 at 11:59pm
25 points
In this first assignment, you will get used to working with AWS, the command line, Hadoop and MapReduce. You will start an EMR cluster, list files, clone repositories, run a MapReduce job, put data into your cluster, and more. The tools you will be using and the skills you will be developing in this assignment are:
- Git
- Command Line (pipes, file operations)
- SSH (logging into remote systems)
- EMR
- S3 (Fetching data, creating a bucket, storing data)
- Hadoop, HDFS, MapReduce, YARN
- Writing mapper and reducer programs in R or Python to use with Hadoop Streaming
You must use the Amazon Cloud resources to run all the code in the assignment. You may develop your mapper/reducer code locally on your laptop but the code still needs to be run on the cloud.
You will be using Github Classroom for assignments. By using git and Github you will learn how to use this tool which is used for version control and should be a part of your analytics workflow. Also, I created a private repository for this course and that way your information remains private. The way it works is that you will need to accepting an invitation with a link to an assignment repository that will then create a private repository within your Github account that is based on the repository for the problem set. Once this private repository is created in your Github account, you will be able to clone it to either your local machine as well as your cluster, and make changes to files, and submit your assignments via this way.
Note: It is possible that my source repository that will be cloned for you may change if there are any issues with the assignment. If they do, I need to let you know so you can sync up your repository with mine.
Before you begin using git and Github, you will need to create a Github account if you don't already have one. Once you create your account, you will need to upload the public ssh key that you created for accessing your AWS resources. By adding your public ssh key to your Github account, you will be able to push changes back to the repository without having to login to GitHub.
- Login to your GitHub account and go to "Settings"
- In the "Settings" area, select "SSH and GPG keys"
- Click on "New SSH key"
- Create a title for your key (the name is whatever you want - just as in AWS)
- Copy the contents of your
id_rsa.pubfile into the "Key" box - Once you do this, you can test that your ssh key works with GitHub by opening a Terminal and typing
ssh -T git@github.com. If you are successful, you will see a message like this:
➜ ~ ssh -T git@github.com
Hi wahalulu! You've successfully authenticated, but GitHub does not provide shell access.I found a very useful Git tutorial which is somewhat humorous. I think it is worth it for you to read through this tutorial: git - the simple guide.
One thing to keep in mind is that whenever you want to use your Github repository on your cloud resources, you will need to connect to your remote machines and pass your ssh private key. This is very easy to do:
- Before you connect to your remote machine, type
ssh-addin your terminal. You will get a confirmation that looks something like this:
➜ ~ ssh-add
Identity added: /Users/marck/.ssh/id_rsa (/Users/marck/.ssh/id_rsa)
- You need to use the
-Aparameter in your ssh command:ssh -A loginname@remote.machine.ip - To test that the ssh-agent forwarded your key to the remote machine you can test the connection to GitHub as we did before:
ssh -T git@github.com.
To clone your assignment repository on your local or remote machine, you need to use the following command:
git clone git@github.com:my-repository-path/name.git. Note: each student's repository name will be different and unique. Make sure you clone your repository from your home directory (on the remote machines) to keep things simple and will create a sub-directory within your home directory that contains the contents of your repository.
Once you clone your repository, change directories and work within the repository directory: cd my-repository-path.
- Start an EMR cluster using the AWS Console with 1 master and 2 core nodes, just like we did in class.
- Once the cluster is up and running, type
ssh-addand then ssh into the master node, remember to use the-Aparameterssh -A hadoop@ip-of-master-node. - Once logged on, install git on your cluster's master node:
sudo yum install -y git - Make sure that your ssh agent was forwarded. Test with
ssh -T git@github.com - Clone your repository:
git@github.com:my-repository-path/name.git - List the files in your cluster's HDFS (will be empty). There are two ways to do it, both are equivalent:
- Type
hdfs dfs -ls - Type
hadoop fs -ls
- Type
- List the files in the course S3 bucket
s3://gwu-bigdata/. There are two ways to do this:-
Using the Hadoop Filesystem Shell Commands which also work with S3. You can list the contents of of a bucket by typing
hdfs -ls s3://gwu-bigdata/[hadoop@ip-172-31-76-170 ~]$ hdfs dfs -ls s3://gwu-bigdata/ Found 3 items drwxrwxrwx - hadoop hadoop 0 1970-01-01 00:00 s3://gwu-bigdata/data drwxrwxrwx - hadoop hadoop 0 1970-01-01 00:00 s3://gwu-bigdata/data-gz drwxrwxrwx - hadoop hadoop 0 1970-01-01 00:00 s3://gwu-bigdata/data-lzo -
Using the AWS Command Line Interface which is installed by default on all AWS resources. This is a Python based utility.
[hadoop@ip-172-31-76-170 ~]$ aws s3 ls s3://gwu-bigdata/ PRE data-gz/ PRE data-lzo/ PRE data/
-
In this problem, you will run a simulated MapReduce job on a small text file. I say simulated because you will not be using Hadoop Framework to do this but rather a combination of command line functions that resemble what happens when you run a Hadoop job on a cluster on a large file.
On page 50 of the book, there is an example of how to test your mapper and reducer. You will be using the same approach here.
There is a file in the repository called Meyers.txt which contains German text. There are two Python files: a mapper called wordcount_mapper.py and a reducer called wordcount_reducer.py. Open the files and look at the code so you get familiar with what is going on.
Your job is to pipe the mapper into sort into the reducer and write the output to a file called wordcount_results.txt.
For problems 2 and 3 you can start a new cluster with 5 nodes (1 master and 4 core) or resize the cluster you started for the Practice Lab and/or Problem 1.
For this problem, you will be working with various text files stored on S3 that are of in the 1-50 GB range. The files contain hypothetic measurements of a scientific instrument called a quazyilx that has been specially created for this class. Every few seconds the quazyilx makes four measurements: fnard, fnok, cark and gnuck. The output looks like this:
YYYY-MM-DDTHH:MM:SSZ fnard:10 fnok:4 cark:2 gnuck:9
(This time format is called ISO-8601 and it has the advantage that it is both unambiguous and that it sorts properly. The Z stands for Greenwich Mean Time or GMT, and is sometimes called Zulu Time because the NATO Phonetic Alphabet word for Z is Zulu.)
When one of the measurements is not present, the result is displayed as negative 1 (e.g. -1).
The quazyilx has been malfunctioning, and occasionally generates output with a -1 for all four measurements, like this:
2015-12-10T08:40:10Z fnard:-1 fnok:-1 cark:-1 gnuck:-1
There are four different versions of the quazyilx file, each of a different size. As you can see in the output below the file sizes are 50MB (1,000,000 rows), 4.8GB (100,000,000 rows), 18GB (369,865,098 rows) and 36.7GB (752,981,134 rows). The only difference is the length of the number of records, the file structure is the same.
[hadoop@ip-172-31-76-170]$ hdfs dfs -ls s3://gwu-bigdata/data/quaz*.txt
-rw-rw-rw- 1 hadoop hadoop 52443735 2017-05-19 13:35 s3://gwu-bigdata/data/quazyilx0.txt
-rw-rw-rw- 1 hadoop hadoop 5244417004 2017-05-19 13:35 s3://gwu-bigdata/data/quazyilx1.txt
-rw-rw-rw- 1 hadoop hadoop 19397230888 2017-05-19 13:35 s3://gwu-bigdata/data/quazyilx2.txt
-rw-rw-rw- 1 hadoop hadoop 39489364082 2017-05-19 13:35 s3://gwu-bigdata/data/quazyilx3.txt
Your job is to find all of the times where the four instruments malfunctioned together using three different methods. The easiest way to do this is with the grep command. Unfortunately, as you can see, the file is big. There are three ways that you can filter to find the bad records:
For this part, you will use the 4.8GB file s3://gwu-bigdata/data/quazyilx1.txt.
In this case, you will be copying the entire file from S3 to the master node of your cluster and then you will run a grep command to find the lines that match the pattern when all four instruments fail.
- SSH into the master node of your EMR cluster
- Copy the file from S3 to the local filesystem of the master node
- Use the
timecommand along withgrepto measure how long it takes to find the lines from the file that match the pattern, and use the Linux Pipe to write the filtered lines to a file calledp1a_results.txt. - By running the following commands, you will create a text file called
p1a_time.txtwith the results fromgrepand another filep1a_time.txtwith the output of the time command:
(time grep "fnard:-1 fnok:-1 cark:-1 gnuck:-1" \
quazyilx1.txt > p2_1_results.txt ) 2> p2_1_time.txt
The files to be committed to the repository for this task are p2_1_results.txt and p2_1_time.txt
In this case, you will stream the file from S3 to grep, the file does not get saved to disk.
(time aws s3 cp s3://gwu-bigdata/data/quazyilx1.txt - | \
grep "fnard:-1 fnok:-1 cark:-1 gnuck:-1" > p2_2_results.txt ) 2> p2_2_time.txt
The files to be committed to the repository for this task are p2_2_results.txt and p2_2_time.txt
In this section you will parallelize the grep with Hadoop Streaming. (To see how Hadoop Streaming has been modified for Amazon Map Reducer, please review the Amazon EMR documentation) Because of the minimal amount of computation done, these tasks are entirely I/O-bound.
You will run three Hadoop Streaming jobs, one for the 4.8GB file (task 3), one for the 18GB file (task 4) and one for the 36.7GB file (task 5).
In this case, our mapper program will be the grep tool with its parameters. The way to invoke the Hadoop Streaming Job is:
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-D mapreduce.job.reduces=0 \
-D stream.non.zero.exit.is.failure=false \
-input [[input-file]] \
-output [[output-location]] \
-mapper "/bin/grep \"fnard:-1 fnok:-1 cark:-1 gnuck:-1\""
What does all this mean?
- The first line
hadoop jar /usr/lib/hadoop/hadoop-streaming.jaris launching Hadoop with the Hadoop Streaming jar. A jar is a Java Archive file, and Hadoop Streaming is a special kind of jar that allows you to run non Java programs. - The second line
-D mapreduce.job.reduces=0tells the job that you want zero reduce tasks. This is a map-only job since all we are doing is filtering and not aggregating. - The third line
-D stream.non.zero.exit.is.failure=falseis another parameter for the streaming job which tells Hadoop to not fail on task error. This is necessary in this case because Hadoop is expecting output from every map task, but since we are filtering the data the majority of the tasks will not have an output. Without this parameter, the job will fail. - The fourth line
-input [[input-file]]tells the job where your source file(s) are. These files need to be either in HDFS or S3. If you specify a directory, all files in the directory will be used as inputs - The fifth line
-output [[output-location]]tells the job where to store the output of the job, either in HDFS or S3. This parameter is just a name of a location, and it must not exist before running the job otherwise the job will fail. - The sixth line
-mapper "/bin/grep \"fnard:-1 fnok:-1 cark:-1 gnuck:-1\""is the actual mapper process.
When you finish running the Hadoop Streaming jobs, you will need to extract the results from HDFS using hdfs dfs -cat [[location_in_hdfs]] | sort > [[output_file]] and create and commit three files: p2_3_results.txt, p2_4_results.txt, p2_5_results.txt.
The file s3://gwu-bigdata/data/forensicswiki.2012.txt is a year's worth of Apache logs for the forensicswiki website. Each line of the log file correspondents to a single HTTP GET command sent to the web server. The log file is in the Combined Log Format.
If you look at the first few lines of the log file, you should be able to figure out the format. You can view the first 10 lines of the file with the command:
aws s3 cp s3://gwu-bigdata/data/forensicswiki.2012.txt - | head -10
At this point, you should understand why this command works.
Your goal in this part is to write mapper and reducer programs using Python3.4 or R with Hadoop Streaming that will report the number of hits for each month. For example, if there were 10,000 hits in January 2010 and 20,000 hits in February 2010, your output should look like this:
2010-01 10000
2010-02 20000
...
Where 10000 and 20000 are replaced by the actual number of hits in each month.
There are four starter files in the repository: logfile_mapper.py, logfile_reducer.py, logfile_mapper.R, logfile_reducer.R which contain mapper and reducer shells in both R and Python. You need to modify these files to make this work.
Here are some hints to solve the problem:
- Your mapper should read each line of the input file and output a key/value pair in the form
YYYY-MM\t1whereYYYY-MMis the year and the month of the log file,\tis the tab character, and1is the number one. - Your reducer should tally up the number of hits for each key and output the results.
- You will need to run the Hadoop Streaming job with the appropriate parameters (see Problem 2 for reference.)
- You will need to "ship" the mapper and reducer to each node in the cluster along with the job. This is done by using the
-filesparameter, so your job submission will look something like this: - You should not need to use any of the
-Dparameters you used in Problem 2
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-files file1, file2 \
...
When you finish running the Hadoop Streaming jobs, you will need to extract the results from HDFS using hdfs dfs -cat [[location_in_hdfs]] | sort > [[output_file]] and create and commit logfile_results.txt.
The files to be committed to the repository for this problem are your mapper, reducer, and logfile_results.txt.
Since we are using Github classroom, you will submit your assignment by "pushing" to your repo on GitHub. You will have to commit all the files that are requested (look at the Git tutorial to learn how to commit files.) After you commit, you will push your changes to your GitHub repository. I will be able to see
- I will look at the results files first. If the results files are what is expected, in the proper format, sorted, etc., then you will get full credit for the problem.
- If the expected results file is not what is expected I will look at the code and provide partial credit where applicable.
- Points will be deducted for each the following reasons:
- Instructions are not followed
- Output is not in expected format
- Output is not sorted
- There are more files in your repository than need to be (only the files that exist now should be there. They should be changed.)
- Additional lines in the results files (wether empty or not)