A simple Word Count Example using pyspark on AWS EMR

Clone the repository

Clone this repo to your local machine.

Create cluster

We'll start off by creating an AWS EMR cluster, just as in the first assignment. Head over to AWS EMR and get started.
Click on Create cluster and configure as per below -

The cluster remains in the 'Starting' state for about 10 - 15 minutes. Once the cluster is ready for use, the status will change to 'Waiting'. You can now go ahead and use it.

Create private key for ssh access

Click on "Learn how to create an EC2 key pair" to create and modify your EC2 key pair.

Allow inbound SSH traffic on the master node

On the left top corner goto Services->EC2
On the left hand panel goto Security Groups under Network & Security
Select the group named "ElasticMapReduce-master" and click Edit in the Inbound tab below
Add rule, select SSH for type and My IP as source. Save

Upload input file on S3

Now head over to Services->S3 and create a bucket named csds
In the bucket, create a folder named csds-spark-emr
Upload the input.txt file from this repo
In permissions, tick the box for read everywhere. Nothing to do in properties
Head forward and submit the file
Click on the uploaded file and click the Make public button just to make sure

Creating wordcount.py on the Master node

Now on our created cluster page (Cluster list->our cluster)
Near the "Master public DNS:" field click the SSH button
Follow the instructions and SSH on the master node
In /home/hadoop create wordcount.py (vi wordcount.py)
Copy over the contents from wordcount.py in this repo
In wordcount.py change the input file s3 url to point to input.txt in your bucket, created above
Save

Executing wordcount.py

Go through the code in wordcount.py and checkout what it does
Execute the script using "spark-submit wordcount.py | tee output.txt"
This will also generate output.txt with a copy of the logs
You may have the output file copied to your s3 bucket by using the cmd "aws s3 cp output.txt s3://my_bucket/my_folder/"
You should see the result of your code among other logs, should look like

And: 2
on: 1
then: 1
Aberbrothok: 2
bell: 1
that: 1
of: 2
knew: 1
Had: 1
placed: 1
Abbot: 2
they: 1
worthy: 1
blest: 1
Rock: 2
Inchcape: 1
the: 3
The: 1
perilous: 1

You're encouraged to play around with the code, check out the documentation and try things out

Terminate the cluster

Don't forget to terminate your cluster after you're done
You'll need to follow the same steps next time you create a new cluster with the exception of creating private key for SSH, you can use the same private key for all clusters
Also make sure to allow inbound SSH traffic on the master every time your machine changes IP, which might happen when you switch between WiFi networks

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
input.txt		input.txt
wordcount.py		wordcount.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A simple Word Count Example using pyspark on AWS EMR

Clone the repository

Create cluster

Create private key for ssh access

Allow inbound SSH traffic on the master node

Upload input file on S3

Creating wordcount.py on the Master node

Executing wordcount.py

Terminate the cluster

About

Uh oh!

Releases

Packages

Languages

Aliga8or/csds-spark-emr

Folders and files

Latest commit

History

Repository files navigation

A simple Word Count Example using pyspark on AWS EMR

Clone the repository

Create cluster

Create private key for ssh access

Allow inbound SSH traffic on the master node

Upload input file on S3

Creating wordcount.py on the Master node

Executing wordcount.py

Terminate the cluster

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages