Repository files navigation Machine Learning Tutorial in Pyspark ML Library
This documnet includes the way of how to run machine learning with Pyspark ml libaray.
It was based on PySpark version 2.1.0 (Python 2.7).
Below Spark version 2, pyspark mllib was the main module for ML, but it entered a maintenance mode.
Instead, at spark 2 verion, pyspark ml module became a main module.
Therefore, this doc was created based on pyspark.ml module.
Description: the dataset including the target variable(default) and features
Rows: 10000
Columns(type): Default(bool) / Student(bool) / Balance(double) / Income(double)
Issue => Binary Classification
Target var: Default (Skewed)
Features: Student, Balance, Income
Pipeline components: Transformer, Estimator, Parameter
Transformer: Scale, linear transformation, vectorize, Prediction
Estimator: learning the data via a modeling algorithm
Parameter: Regulization, the number of iterations
Pipeline Stage Examples
StringIndexer: Convert string to index
OneHotEncoder: Convert a categorical variable to dummies
VectorAssembler: vectorize
StandardScaler: transform the original values to Z-score
LinearRegression: the famous model for predicting real numbers
Learn the dataset with algorithm
Evaluation / Parameter Tuning
Model selection (a.k.a. hyperparameter tuning)
Cross-Validation
Train-Validation Split
Save & Load the Pipeline and Model
About
This respository deals with the tutorial of Pyspark ML library
Resources
Stars
Watchers
Forks
You can’t perform that action at this time.