SkewDebug is a tool to detect source code locations in a Spark Pipeline making easier to fix Data, Memory and Computation Skews.
The tool is created as a class file in src/main/scala/SkewDetection/
Steps to use the tool:
-
Import the class
-
Create the SkewDebug Object
-
Pass your SparkContext as a constructor to the SkewDebug class during the object creation
-
After your implementation of the pipeline you can simply call the printlog function from the SkewDebug Object
-
Run your pipeline
Example of the Pipeline and the working of the tool is mentioned in: src/main/scala/hc/PipeLine
The location of the dataset needs to be changed as we are using the ticket_flights.csv file which is in the data folder.