Skip to content

Aliga8or/csds-spark-emr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

A simple Word Count Example using pyspark on AWS EMR

Clone the repository

  • Clone this repo to your local machine.

Create cluster

  • We'll start off by creating an AWS EMR cluster, just as in the first assignment. Head over to AWS EMR and get started.
  • Click on Create cluster and configure as per below -

img

  • The cluster remains in the 'Starting' state for about 10 - 15 minutes. Once the cluster is ready for use, the status will change to 'Waiting'. You can now go ahead and use it.

Create private key for ssh access

  • Click on "Learn how to create an EC2 key pair" to create and modify your EC2 key pair.

Allow inbound SSH traffic on the master node

  • On the left top corner goto Services->EC2
  • On the left hand panel goto Security Groups under Network & Security
  • Select the group named "ElasticMapReduce-master" and click Edit in the Inbound tab below
  • Add rule, select SSH for type and My IP as source. Save

Upload input file on S3

  • Now head over to Services->S3 and create a bucket named csds
  • In the bucket, create a folder named csds-spark-emr
  • Upload the input.txt file from this repo
  • In permissions, tick the box for read everywhere. Nothing to do in properties
  • Head forward and submit the file
  • Click on the uploaded file and click the Make public button just to make sure

Creating wordcount.py on the Master node

  • Now on our created cluster page (Cluster list->our cluster)
  • Near the "Master public DNS:" field click the SSH button
  • Follow the instructions and SSH on the master node
  • In /home/hadoop create wordcount.py (vi wordcount.py)
  • Copy over the contents from wordcount.py in this repo
  • In wordcount.py change the input file s3 url to point to input.txt in your bucket, created above
  • Save

Executing wordcount.py

  • Go through the code in wordcount.py and checkout what it does
  • Execute the script using "spark-submit wordcount.py | tee output.txt"
  • This will also generate output.txt with a copy of the logs
  • You may have the output file copied to your s3 bucket by using the cmd "aws s3 cp output.txt s3://my_bucket/my_folder/"
  • You should see the result of your code among other logs, should look like

And: 2
on: 1
then: 1
Aberbrothok: 2
bell: 1
that: 1
of: 2
knew: 1
Had: 1
placed: 1
Abbot: 2
they: 1
worthy: 1
blest: 1
Rock: 2
Inchcape: 1
the: 3
The: 1
perilous: 1

  • You're encouraged to play around with the code, check out the documentation and try things out

Terminate the cluster

  • Don't forget to terminate your cluster after you're done
  • You'll need to follow the same steps next time you create a new cluster with the exception of creating private key for SSH, you can use the same private key for all clusters
  • Also make sure to allow inbound SSH traffic on the master every time your machine changes IP, which might happen when you switch between WiFi networks

About

A simple Word Count Example using pyspark on AWS EMR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages