Skip to content

songhunhwa/MachineLearning_Pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

Machine Learning Tutorial in Pyspark ML Library

Info.

  • This documnet includes the way of how to run machine learning with Pyspark ml libaray.
  • It was based on PySpark version 2.1.0 (Python 2.7).
  • Below Spark version 2, pyspark mllib was the main module for ML, but it entered a maintenance mode.
  • Instead, at spark 2 verion, pyspark ml module became a main module.
  • Therefore, this doc was created based on pyspark.ml module.

Dataset

  • Description: the dataset including the target variable(default) and features
  • Rows: 10000
  • Columns(type): Default(bool) / Student(bool) / Balance(double) / Income(double)
  • Issue => Binary Classification
  • Target var: Default (Skewed)
  • Features: Student, Balance, Income

Pipeline

Pipeline components: Transformer, Estimator, Parameter

  • Transformer: Scale, linear transformation, vectorize, Prediction
    • Estimator: learning the data via a modeling algorithm
    • Parameter: Regulization, the number of iterations
  • Pipeline Stage Examples
    • StringIndexer: Convert string to index
    • OneHotEncoder: Convert a categorical variable to dummies
    • VectorAssembler: vectorize
    • StandardScaler: transform the original values to Z-score
    • LinearRegression: the famous model for predicting real numbers

Learn the dataset with algorithm

Evaluation / Parameter Tuning

  • Model selection (a.k.a. hyperparameter tuning)
  • Cross-Validation
  • Train-Validation Split

Save & Load the Pipeline and Model

About

This respository deals with the tutorial of Pyspark ML library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages