Machine Learning Tutorial in Pyspark ML Library

Info.

This documnet includes the way of how to run machine learning with Pyspark ml libaray.
It was based on PySpark version 2.1.0 (Python 2.7).
Below Spark version 2, pyspark mllib was the main module for ML, but it entered a maintenance mode.
Instead, at spark 2 verion, pyspark ml module became a main module.
Therefore, this doc was created based on pyspark.ml module.

Dataset

Description: the dataset including the target variable(default) and features
Rows: 10000
Columns(type): Default(bool) / Student(bool) / Balance(double) / Income(double)
Issue => Binary Classification
Target var: Default (Skewed)
Features: Student, Balance, Income

Pipeline

API docs: http://takwatanabe.me/pyspark/index.html
Overivew: https://spark.apache.org/docs/latest/ml-guide.html
Featurization, Pipelines, Persistence, Utilities
DataFrame-based API is primary API
RDD-based spark.mllib package will be depricated from Spark 3.0)
As DataFrames (with Pipeline) is more user-friendly, this data type will be more frequently used.

Pipeline components: Transformer, Estimator, Parameter

Transformer: Scale, linear transformation, vectorize, Prediction
- Estimator: learning the data via a modeling algorithm
- Parameter: Regulization, the number of iterations
Pipeline Stage Examples
- StringIndexer: Convert string to index
- OneHotEncoder: Convert a categorical variable to dummies
- VectorAssembler: vectorize
- StandardScaler: transform the original values to Z-score
- LinearRegression: the famous model for predicting real numbers

Learn the dataset with algorithm

Reference: https://spark.apache.org/docs/latest/ml-classification-regression.html
Classification
- Logistic regression

Evaluation / Parameter Tuning

Model selection (a.k.a. hyperparameter tuning)
Cross-Validation
Train-Validation Split

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
Pyspark_ML_Wiki.py		Pyspark_ML_Wiki.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Tutorial in Pyspark ML Library

Info.

Dataset

Pipeline

Pipeline components: Transformer, Estimator, Parameter

Learn the dataset with algorithm

Evaluation / Parameter Tuning

Save & Load the Pipeline and Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Tutorial in Pyspark ML Library

Info.

Dataset

Pipeline

Pipeline components: Transformer, Estimator, Parameter

Learn the dataset with algorithm

Evaluation / Parameter Tuning

Save & Load the Pipeline and Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages