Lab: PIG and Hive

Start your cluster

Start an Amazon Elastic MapReduce (EMR) Cluster using Quickstart with the following setup:
- Give the cluster a name that is meaningful to you
- Use Release emr-5.12.0
- Select the first option under Applications (Core Hadoop: Hadoop 2.8.3 with Ganglia 3.7.2, Hive 2.3.2, Hue 4.1.0, Mahout 0.13.0, Pig 0.17.0, and Tez 0.8.4)
- Select 1 master and 2 core nodes, using m4.large instance types
- Select your correct EC2 keypair or you will not be able to connect to the cluster
- Click Create Cluster
Once the cluster is up and running and in "waiting" state, ssh into the master node: ssh hadoop@[[master-node-dns-name]]
Install git on the master node: sudo yum install -y git
Clone this repository to the master node. Note: since this is a public repository you do can use the http GitHub URL: git clone https://github.com/bigdatateaching/pig-hive.git
Change directory into the lab: cd pig-hive

PIG Exercises

Look at the contents of the file pigdemo.tx

[hadoop@ip-172-31-2-208 pig-hive]$ cat pigdemo.txt
SD	Rich
NV	Barry
CO	George
CA	Ulf
IL	Danielle
OH	Tom
CA	manish
CA	Brian
CO	Mark

Start the Grunt Shell: pig
You can run HDFS commands from the Grunt Shell:
- grunt> ls
- Make a directory within the cluster HDFS called lab06: mkdir pig-hive-lab
- Copy the pigdemo.txt file from the local filesystem to HDFS: copyFromLocal pigdemo.txt pig-hive-lab/
- Check to make sure the file was copied:
```
grunt> ls pig-hive-lab/
hdfs://ip-172-31-2-208.ec2.internal:8020/user/hadoop/pig-hive-lab/pigdemo.txt<r 2>	80
```
Define the employees relation and load data from the pigdemo.txt file (from HDFS - remember, you just copied the file from remote filesystem to HDFS), using a schema with field names state and name:
```
grunt> employees = LOAD 'pig-hive-lab/pigdemo.txt' AS (state, name);
```

Use the describe command to see what the employees relation looks like:

grunt> describe employees;
employees: {state: bytearray,name: bytearray}

Use DUMP to see the contents of the employees relation:

grunt> DUMP employees
... lots of text from the spawned MapReduce job ...
(SD,Rich)
(NV,Barry)
(CO,George)
(CA,Ulf)
(IL,Danielle)
(OH,Tom)
(CA,manish)
(CA,Brian)
(CO,Mark)

Create a new relation called ca_only and filter the employees relation to get only the records where the state is CA (California):

grunt> ca_only = FILTER employees BY (state=='CA');
681844 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan  - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
18/02/26 17:23:01 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
grunt>

See the contents of ca_only. The output is still tuples but only the records that match the filter.

grunt> DUMP ca_only
... lots of text from the spawned MapReduce job ...
(CA,Ulf)
(CA,manish)
(CA,Brian)

Create a new relation called emp_group where records are grouped by state:

grunt> emp_group = GROUP employees BY state;
1186375 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan  - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
18/02/26 17:31:25 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).

Describe emo_group. Bags represent groups in PIG. A Bag is an unordered collection of tuples;

grunt> describe emp_group
emp_group: {group: bytearray,employees: {(state: bytearray,name: bytearray)}}

See the contents of the emp_group:

grunt> DUMP emp_group
... lots of text from the spawned MapReduce job ...
(CA,{(CA,Ulf),(CA,manish),(CA,Brian)})
(CO,{(CO,George),(CO,Mark)})
(IL,{(IL,Danielle)})
(NV,{(NV,Barry)})
(OH,{(OH,Tom)})
(SD,{(SD,Rich)})

Use the STORE command to write the emp_group file back to HDFS (or S3). Notice that the name given is that of a directory, not of a file

grunt> STORE emp_group INTO 'emp_group';
... lots of text from the spawned MapReduce job ...
Input(s):
Successfully read 9 records (80 bytes) from: "hdfs://ip-172-31-2-208.ec2.internal:8020/user/hadoop/pig-hive-lab/pigdemo.txt"

Output(s):
Successfully stored 6 records (128 bytes) in: "hdfs://ip-172-31-2-208.ec2.internal:8020/user/hadoop/emp_group"

To see all the relations that you have created: aliases

Use PIG to generate a dataset for Hive

Exit the Grunt Shell if you haven't already: grunt> quit;
Make sure you are within the lab directory. You can check by using the Linux command pwd (present working directory). You should be in /home/hadoop/pig-hive, if not change to it.
Unzip the White House visits dataset: unzip whitehouse_visits.zip. This creates a file called whitehouse_visits.txt
Create a directory within HDFS: hadoop fs -mkdir whitehouse
Copy the whitehouse_visits.txt file from the local filesystem into HDFS hadoop fs -put whitehouse_visits.txt whitehouse/visits.txt (Note the file has been renamed within HDFS)
Explore the contents of the Pig file wh_visits.pig. This is a pre-written Pig script that loads the visits.txt and extracts certain fields, filters, and writes the output to a location in HDFS: cat wh_visits.pig
Run the Pig script: pig wh_visits.pig

You should have a set of new files in the hive_wh_visits directory inside HDFS:

[hadoop@ip-172-31-18-99 lab06-spring-2018]$ hadoop fs -ls
Found 3 items
drwxr-xr-x   - hadoop hadoop          0 2018-02-26 20:29 hive_wh_visits
drwxr-xr-x   - hadoop hadoop          0 2018-02-26 18:14 pig-hive-lab
drwxr-xr-x   - hadoop hadoop          0 2018-02-26 20:28 whitehouse
[hadoop@ip-172-31-18-99 lab06-spring-2018]$ hadoop fs -ls hive_wh_visits
Found 3 items
-rw-r--r--   1 hadoop hadoop          0 2018-02-26 20:29 hive_wh_visits/_SUCCESS
-rw-r--r--   1 hadoop hadoop     971339 2018-02-26 20:29 hive_wh_visits/part-v000-o000-r-00000
-rw-r--r--   1 hadoop hadoop     142850 2018-02-26 20:28 hive_wh_visits/part-v000-o000-r-00001

Explore the contents of the results file/files:

[hadoop@ip-172-31-18-99 lab06-spring-2018]$ hadoop fs -cat hive_wh_visits/part-v000-o000-r-00000 | head
BUCKLEY	SUMMER	10/12/2010 14:48	10/12/2010 14:45	WH
CLOONEY	GEORGE	10/12/2010 14:47	10/12/2010 14:45	WH
PRENDERGAST	JOHN	10/12/2010 14:48	10/12/2010 14:45	WH
LANIER	JAZMIN		10/13/2010 13:00	WH	BILL SIGNING/
MAYNARD	ELIZABETH	10/13/2010 12:34	10/13/2010 13:00	WH	BILL SIGNING/
MAYNARD	GREGORY	10/13/2010 12:35	10/13/2010 13:00	WH	BILL SIGNING/
MAYNARD	JOANNE	10/13/2010 12:35	10/13/2010 13:00	WH	BILL SIGNING/
MAYNARD	KATHERINE	10/13/2010 12:34	10/13/2010 13:00	WH	BILL SIGNING/
MAYNARD	PHILIP	10/13/2010 12:35	10/13/2010 13:00	WH	BILL SIGNING/
MOHAN	EDWARD	10/13/2010 12:37	10/13/2010 13:00	WH	BILL SIGNING/
cat: Unable to write to output stream.

You will use this file as an input to Hive in the next exercise

Hive Exercises

Start the hive console:

[hadoop@ip-172-31-18-99 pig-hive]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: false
hive>

Create an External Table from the files created from the PIG output called wh_visits:

create external table wh_visits (
lname string, 
fname string,
time_of_arrival string, 
appt_scheduled_time string,
meeting_location string,
info_comment string
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' 
LOCATION  '/user/hadoop/hive_wh_visits/';

Write a SQL query to count the number of records in the wh_visits table:
```
select count(*) from wh_visits;
```
Shot the first 20 records of the wh_visits table:
```
select * from wh_visits limit 20;
```
Write a SQL query that gives you the count of comments for records where the comment is not empty.
Once you figure out the correct SQL statement, create a CSV file in HDFS with the results of the query from the previous step. The syntax to use is the following:
```
INSERT OVERWRITE DIRECTORY '[[hdfs/s3 output directory]]'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' 
select ...;
```
Where:
- [[hdfs/s3 output directory]] is the name of a directory to be created, where the results files from the MapReduce process will be placed.
- select ... is the query

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
congress.txt		congress.txt
pigdemo.txt		pigdemo.txt
wh_visits.pig		wh_visits.pig
whitehouse_visits.zip		whitehouse_visits.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lab: PIG and Hive

Start your cluster

PIG Exercises

Use PIG to generate a dataset for Hive

Hive Exercises

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lab: PIG and Hive

Start your cluster

PIG Exercises

Use PIG to generate a dataset for Hive

Hive Exercises

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages