Hey there, thanks for putting this together.
As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.
I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:
> python scripts/emr_submit.py
(wait until the job completes)
> aws s3 cp s3://samwheating-1trc/output/part-00000-3cbfb625-d602-4663-8cbe-2007d8f5458a-c000.csv - | head -n 10
Abha,83.4,-43.8,18.000157677827257
Abidjan,87.8,-34.4,25.99981470248089
Abéché,92.1,-33.6,29.400132310152458
Accra,86.7,-34.1,26.400022720245452
Addis Ababa,79.6,-58.0,16.000132105603647
Adelaide,76.8,-45.1,17.29980559679163
Aden,93.2,-33.1,29.100014447157122
Ahvaz,86.6,-37.5,25.3998722446478
Albuquerque,77.3,-54.7,13.99982215316457
Alexandra,72.6,-47.6,11.000355765170788
Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:
32 machines * 32 minutes * ($0.085 spot rate + 0.048 EMR premium)/machine-hr * 1hr/60min = $2.2698
With larger/more machines this would probably be proportionally faster as this job is almost entirely parallelizable.
I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.
With some more time, I think it would be interesting to try:
- pre-loading the dataset to cluster-local HDFS (to reduce the time spent downloading data)
- re-running with an increased value of
spark.sql.files.maxPartitionBytes in order to reduce scheduling / task overhead.
Anyways, let me know what you think, or if you've got other suggestions for improving this.
Hey there, thanks for putting this together.
As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.
I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:
Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:
With larger/more machines this would probably be proportionally faster as this job is almost entirely parallelizable.
I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.
With some more time, I think it would be interesting to try:
spark.sql.files.maxPartitionBytesin order to reduce scheduling / task overhead.Anyways, let me know what you think, or if you've got other suggestions for improving this.