Skip to content

Alexander0711/disaster-response-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: Disaster Response Pipeline

Table of Content

1. Project Overview

The motivation behind this project is to classify disaster messages into categories. Through a web app, the user can input a new disaster message and get classification results in several categories. With this classification help can be organized in an efficient way.

In the “Disaster Response Pipeline” project, I will apply data engineering and machine learning to analyze disaster data provided by Figure Eight and Udacity to build a ML classifier model that classifies disaster messages from social media and news.

The 'data' directory contains real messages that were sent during disaster events. I will create a machine learning pipeline to categorize these events so that appropriate disaster help agencies can be reached out for help.

In the project data 26248 messages with a unique id are included. Each massage will be categorized in the ML model within 36 categories.

This project will include a web app where an emergency worker can input a new message and get classification results in several categories. The web app will also display visualizations of the data.

2. Project Software Stack

The software stack of this project contains three main parts:

2.1. ETL Pipeline

File /data/process_data.py contains data cleaning pipeline:

  • Loads the 'disaster_messages' and 'disaster_categories' dataset
  • Merges the two datasets in one
  • Cleans the data in the combined data frame
  • Stores the data in a SQLite database “DisasterResponse.db”

2.2. ML Pipeline

File /models/train_classifier.py contains the machine learning pipeline:

  • Loads data from the SQLite database “DisasterResponse.db”
  • Splits the data into train and test data sets
  • Builds a text processing and machine learning pipeline
  • Trains and tunes a model using GridSearchCV
  • Outputs analytics result on the test set
  • Exports the final model as a pickle file

2.3. Flask Web App

Running the start command (please see No.3) from app directory will start the web app. Users can enter their query, i.e., a request message sent during a natural disaster, e.g. "We need more water and food in New York!".

Screenshot 1

Screen

What the app will do is that it will classify the new text message entered via the app. The app will classify the message into categories so that the relief agency can be reached out for help.

3. Run the Project

Starting with the ETL process there are three steps necessary to get the WebApp in place and use the tool.

Screenshot 2

Screen

3.1. Data Cleaning

Run the following commands in the project's root directory to set up your database and model.

To run ETL pipeline that cleans data and stores in database:

python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db

The first two arguments are input data and the third argument is the SQLite Database in which we want to save the cleaned data. The ETL pipeline is in process_data.py.

DisasterResponse.db already exists in project's root directory folder but the above command will still run and replace the file with same information.

3.2. Training ML Classifier

After the data cleaning process, run this command to run ML pipeline that trains classifier and saves ML classifier from the project directory:

python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl

This will use cleaned data to train the ML model, improve the model with grid search and saved the model to a pickle file (classifer.pkl).

classifier.pkl already exists but the above command will still run and replace the file will new information.

3.3. Run the Web App

After data cleaning and creation of the ML model the ML model will be used to predict new messages direct in the Web App interface. Run the following command in the app's directory to run the web app.

Go the app directory and run the following command:

python run.py

This will start the web app.

Go to http://0.0.0.0:3001/

Here you can enter messages and get classification results for it.

Screenshot 3 "Frontend"

Screen

Screenshot 4 "Backend"

Screen

4. File Structure

.
├── app
│   ├── run.py------------------------# FLASK FILE THAT RUNS APP
│   ├── templates
│       ├── go.html-------------------# CLASSIFICATION RESULT PAGE
│       └── master.html---------------# MAIN PAGE OF WEB APP
├── data
│   ├── 
│   ├── disaster_categories.csv-------# DATA TO PROCESS
│   ├── disaster_messages.csv---------# DATA TO PROCESS
│   └── process_data.py---------------# PERFORMS ETL PROCESS
├── images ---------------------------# PLOTS and SCREENSHOTS
├── models
│   └── classifier.pkl----------------# ML MODEL
│   └── train_classifier.py-----------# PERFORMS CLASSIFICATION TASK
├──DisasterResponse.db----------------# DATABASE TO SAVE CLEANED DATA
│

5. Software Requirements

The project uses Python 3.7 and additional libraries:

  • pandas
  • numpy
  • sys
  • time
  • collections
  • json
  • re
  • warnings
  • operator
  • pickle
  • pprint
  • flask
  • nltk
  • plotly
  • scikit-learn
  • SQLAlchemy

6. Conclusion

Classification metrics for the ML model:

  • Accuracy of the ML model is: 0.95 (accuracy is the fraction of samples predicted correctly)

  • Recall of the ML model is: 0.64 (also known as sensitivity; is the fraction of positive events that were predicted correctly)

  • f1-score of the ML model is: 0.70 (f1-score is the harmonic mean of recall and precision, higher score means a better model)

Link to understanding Data Science Classification Metrics in Scikit-Learn in Python

Link regarding Ffne tuning a classifier in scikit-learn

You will see the exact value after the model is trained by grid searchdirect in the command line.

Though the accuracy metric is high it has a poor value for recall. This ML model is not yet ready for production or the data input is not enough. But for this Show-Case the data and the ML model was great!

Screenshot 5

Screen

7. Links

Classification Metrics:

Link to understanding Data Science Classification Metrics in Scikit-Learn in Python

Link regarding Ffne tuning a classifier in scikit-learn

Help:

Stopwords

Normalize Text

GridSearchCV

Pickle

Ideas, Help and Templates:

GitHub from Sanjeev Yadav

GitHub from Genevieve Hayes

This project was completed as part of the Udacity Data Scientist Nanodegree. Code templates and data were provided by Udacity. The data was originally sourced by Udacity from Figure Eight.

Udacity Data Scientist Nanodegree

About

ETL pipeline combined with a ML model for supervised learning and grid search to classify text messages sent during disaster events

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors