-
Start an Amazon Elastic MapReduce (EMR) Cluster using Quickstart with the following setup:
- Give the cluster a name that is meaningful to you
- Use Release
emr-5.12.0 - Select the first option under Applications (Core Hadoop: Hadoop 2.8.3 with Ganglia 3.7.2, Hive 2.3.2, Hue 4.1.0, Mahout 0.13.0, Pig 0.17.0, and Tez 0.8.4)
- Select 1 master and 2 core nodes, using
m4.largeinstance types - Select your correct EC2 keypair or you will not be able to connect to the cluster
- Click Create Cluster
-
Once the cluster is up and running and in "waiting" state, ssh into the master node:
ssh hadoop@[[master-node-dns-name]] -
Install git on the master node:
sudo yum install -y git -
Clone this repository to the master node. Note: since this is a public repository you do can use the
httpGitHub URL:git clone https://github.com/bigdatateaching/pig-hive.git -
Change directory into the lab:
cd pig-hive
-
Look at the contents of the file
pigdemo.tx[hadoop@ip-172-31-2-208 pig-hive]$ cat pigdemo.txt SD Rich NV Barry CO George CA Ulf IL Danielle OH Tom CA manish CA Brian CO Mark -
Start the Grunt Shell:
pig -
You can run HDFS commands from the Grunt Shell:
grunt> ls- Make a directory within the cluster HDFS called lab06:
mkdir pig-hive-lab - Copy the
pigdemo.txtfile from the local filesystem to HDFS:copyFromLocal pigdemo.txt pig-hive-lab/ - Check to make sure the file was copied:
grunt> ls pig-hive-lab/ hdfs://ip-172-31-2-208.ec2.internal:8020/user/hadoop/pig-hive-lab/pigdemo.txt<r 2> 80 -
Define the
employeesrelation and load data from thepigdemo.txtfile (from HDFS - remember, you just copied the file from remote filesystem to HDFS), using a schema with field names state and name:grunt> employees = LOAD 'pig-hive-lab/pigdemo.txt' AS (state, name); -
Use the describe command to see what the
employeesrelation looks like:grunt> describe employees; employees: {state: bytearray,name: bytearray} -
Use
DUMPto see the contents of the employees relation:grunt> DUMP employees ... lots of text from the spawned MapReduce job ... (SD,Rich) (NV,Barry) (CO,George) (CA,Ulf) (IL,Danielle) (OH,Tom) (CA,manish) (CA,Brian) (CO,Mark) -
Create a new relation called
ca_onlyand filter theemployeesrelation to get only the records where the state isCA(California):grunt> ca_only = FILTER employees BY (state=='CA'); 681844 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). 18/02/26 17:23:01 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). grunt> -
See the contents of
ca_only. The output is still tuples but only the records that match the filter.grunt> DUMP ca_only ... lots of text from the spawned MapReduce job ... (CA,Ulf) (CA,manish) (CA,Brian) -
Create a new relation called
emp_groupwhere records are grouped by state:grunt> emp_group = GROUP employees BY state; 1186375 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). 18/02/26 17:31:25 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). -
Describe
emo_group. Bags represent groups in PIG. A Bag is an unordered collection of tuples;grunt> describe emp_group emp_group: {group: bytearray,employees: {(state: bytearray,name: bytearray)}} -
See the contents of the
emp_group:grunt> DUMP emp_group ... lots of text from the spawned MapReduce job ... (CA,{(CA,Ulf),(CA,manish),(CA,Brian)}) (CO,{(CO,George),(CO,Mark)}) (IL,{(IL,Danielle)}) (NV,{(NV,Barry)}) (OH,{(OH,Tom)}) (SD,{(SD,Rich)}) -
Use the
STOREcommand to write theemp_groupfile back to HDFS (or S3). Notice that the name given is that of a directory, not of a filegrunt> STORE emp_group INTO 'emp_group'; ... lots of text from the spawned MapReduce job ... Input(s): Successfully read 9 records (80 bytes) from: "hdfs://ip-172-31-2-208.ec2.internal:8020/user/hadoop/pig-hive-lab/pigdemo.txt" Output(s): Successfully stored 6 records (128 bytes) in: "hdfs://ip-172-31-2-208.ec2.internal:8020/user/hadoop/emp_group" -
To see all the relations that you have created:
aliases
-
Exit the Grunt Shell if you haven't already:
grunt> quit; -
Make sure you are within the lab directory. You can check by using the Linux command
pwd(present working directory). You should be in/home/hadoop/pig-hive, if not change to it. -
Unzip the White House visits dataset:
unzip whitehouse_visits.zip. This creates a file calledwhitehouse_visits.txt -
Create a directory within HDFS:
hadoop fs -mkdir whitehouse -
Copy the
whitehouse_visits.txtfile from the local filesystem into HDFShadoop fs -put whitehouse_visits.txt whitehouse/visits.txt(Note the file has been renamed within HDFS) -
Explore the contents of the Pig file
wh_visits.pig. This is a pre-written Pig script that loads thevisits.txtand extracts certain fields, filters, and writes the output to a location in HDFS:cat wh_visits.pig -
Run the Pig script:
pig wh_visits.pig -
You should have a set of new files in the
hive_wh_visitsdirectory inside HDFS:[hadoop@ip-172-31-18-99 lab06-spring-2018]$ hadoop fs -ls Found 3 items drwxr-xr-x - hadoop hadoop 0 2018-02-26 20:29 hive_wh_visits drwxr-xr-x - hadoop hadoop 0 2018-02-26 18:14 pig-hive-lab drwxr-xr-x - hadoop hadoop 0 2018-02-26 20:28 whitehouse [hadoop@ip-172-31-18-99 lab06-spring-2018]$ hadoop fs -ls hive_wh_visits Found 3 items -rw-r--r-- 1 hadoop hadoop 0 2018-02-26 20:29 hive_wh_visits/_SUCCESS -rw-r--r-- 1 hadoop hadoop 971339 2018-02-26 20:29 hive_wh_visits/part-v000-o000-r-00000 -rw-r--r-- 1 hadoop hadoop 142850 2018-02-26 20:28 hive_wh_visits/part-v000-o000-r-00001 -
Explore the contents of the results file/files:
[hadoop@ip-172-31-18-99 lab06-spring-2018]$ hadoop fs -cat hive_wh_visits/part-v000-o000-r-00000 | head BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEY GEORGE 10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGAST JOHN 10/12/2010 14:48 10/12/2010 14:45 WH LANIER JAZMIN 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD ELIZABETH 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD GREGORY 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD JOANNE 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD KATHERINE 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD PHILIP 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MOHAN EDWARD 10/13/2010 12:37 10/13/2010 13:00 WH BILL SIGNING/ cat: Unable to write to output stream. -
You will use this file as an input to Hive in the next exercise
-
Start the hive console:
[hadoop@ip-172-31-18-99 pig-hive]$ hive Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: false hive> -
Create an External Table from the files created from the PIG output called
wh_visits:create external table wh_visits ( lname string, fname string, time_of_arrival string, appt_scheduled_time string, meeting_location string, info_comment string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/user/hadoop/hive_wh_visits/'; -
Write a SQL query to count the number of records in the
wh_visitstable:select count(*) from wh_visits; -
Shot the first 20 records of the
wh_visitstable:select * from wh_visits limit 20; -
Write a SQL query that gives you the count of comments for records where the comment is not empty.
-
Once you figure out the correct SQL statement, create a CSV file in HDFS with the results of the query from the previous step. The syntax to use is the following:
INSERT OVERWRITE DIRECTORY '[[hdfs/s3 output directory]]' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select ...;Where:
[[hdfs/s3 output directory]]is the name of a directory to be created, where the results files from the MapReduce process will be placed.select ...is the query