SparkSQL on Amazon Elastic MapReduce (EMR)

Hey there, thanks for putting this together. 

As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.

I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:
```
> python scripts/emr_submit.py 
(wait until the job completes)
> aws s3 cp s3://samwheating-1trc/output/part-00000-3cbfb625-d602-4663-8cbe-2007d8f5458a-c000.csv - | head -n 10
Abha,83.4,-43.8,18.000157677827257
Abidjan,87.8,-34.4,25.99981470248089
Abéché,92.1,-33.6,29.400132310152458
Accra,86.7,-34.1,26.400022720245452
Addis Ababa,79.6,-58.0,16.000132105603647
Adelaide,76.8,-45.1,17.29980559679163
Aden,93.2,-33.1,29.100014447157122
Ahvaz,86.6,-37.5,25.3998722446478
Albuquerque,77.3,-54.7,13.99982215316457
Alexandra,72.6,-47.6,11.000355765170788
```

Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:
```
32 machines * 32 minutes * ($0.085 spot rate + 0.048 EMR premium)/machine-hr * 1hr/60min = $2.2698
```
With larger/more machines this would probably be proportionally faster as this job is _almost_ entirely parallelizable.

I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.

With some more time, I think it would be interesting to try:
 - pre-loading the dataset to cluster-local HDFS (to reduce the time spent downloading data)
 - re-running with an increased value of `spark.sql.files.maxPartitionBytes` in order to reduce scheduling / task overhead.
 
 Anyways, let me know what you think, or if you've got other suggestions for improving this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparkSQL on Amazon Elastic MapReduce (EMR) #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SparkSQL on Amazon Elastic MapReduce (EMR) #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions