Skip to content

luketchang/go-map-reduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go-MapReduce-Framework

Summary

  • Overview: Distributed MapReduce framework leveraging GCloud VMs for server and workers
  • Server:
    • Reads arguments and supplied configuration file for initialization parameters (e.g. input/output directories, custom mapper/reducer scripts, number of mappers/reducers, etc)
    • Start listening on designated port
    • Spawns mappers, responding to mapper requests with new input files (from shared_files/input) and acknowledging mapper messages
    • Spawns reducers, responding to reducer requests with new intermediate files and acknowledging reducer messages
  • Mappers:
    • Repeatedly requests input from server
    • Runs user-supplied custom mapper script on received input file
    • Buckets key-value pairs of mapped file into separate files based on key's hash-value (in shared_files/intermediate)
    • Notifies server of progress and final job status for each input
  • Reducers:
    • Repeatedly requests intermediate files from server
    • Runs sort and 'group-by-key' script on received intermediat files
    • Runs user-supplied custom reducer script on sorted/grouped file (reduced files output to shared_files/output)
    • Notifies server of progress and final job status for each input

Design Questions/Decisions

  • Combined mapping and hash-bucketing steps under mapper but could have had 3 separate entities (mapper, bucketer, and reducer) to better adhere to Single Responsibility Principle
  • Additionally, if bucketers were separate entities that ran concurrently with mappers, higher performance could be achieved with bucketers performing job as soon as new mapped file is available
  • Used Google Filestore for shared storage across VMs but basic Filestore I/O ended up being extremely slow (would likely switch to different alternative or have workers send locally processed files back to server instead of having shared storage)

Server Logs

Todo

  • *Add automated tests for verifying consistency of outputs of remote commands
  • Add thorough error checking in argument and config parsers
  • Do better job of surfacing error codes to server in functions called by remote executables
  • Add builtin remote failures to test job rescheduling functionality
  • Improve readability of util Contains function
  • Fix network access and permissioning among Google Cloud VMs
  • Switch from Filestore to different alternative for faster shared I/O
  • Add user CLI functionality for uploading custom input and scripts to VMs

About

Distributed MapReduce implementation written in Go

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors