- Project Overview
- Project Software Stack
- Run the Project
- File Structure
- Software Requirements
- Conclusion
- Links
The motivation behind this project is to classify disaster messages into categories. Through a web app, the user can input a new disaster message and get classification results in several categories. With this classification help can be organized in an efficient way.
In the “Disaster Response Pipeline” project, I will apply data engineering and machine learning to analyze disaster data provided by Figure Eight and Udacity to build a ML classifier model that classifies disaster messages from social media and news.
The 'data' directory contains real messages that were sent during disaster events. I will create a machine learning pipeline to categorize these events so that appropriate disaster help agencies can be reached out for help.
In the project data 26248 messages with a unique id are included. Each massage will be categorized in the ML model within 36 categories.
This project will include a web app where an emergency worker can input a new message and get classification results in several categories. The web app will also display visualizations of the data.
The software stack of this project contains three main parts:
File /data/process_data.py contains data cleaning pipeline:
- Loads the 'disaster_messages' and 'disaster_categories' dataset
- Merges the two datasets in one
- Cleans the data in the combined data frame
- Stores the data in a SQLite database “DisasterResponse.db”
File /models/train_classifier.py contains the machine learning pipeline:
- Loads data from the SQLite database “DisasterResponse.db”
- Splits the data into train and test data sets
- Builds a text processing and machine learning pipeline
- Trains and tunes a model using GridSearchCV
- Outputs analytics result on the test set
- Exports the final model as a pickle file
Running the start command (please see No.3) from app directory will start the web app. Users can enter their query, i.e., a request message sent during a natural disaster, e.g. "We need more water and food in New York!".
Screenshot 1
What the app will do is that it will classify the new text message entered via the app. The app will classify the message into categories so that the relief agency can be reached out for help.
Starting with the ETL process there are three steps necessary to get the WebApp in place and use the tool.
Screenshot 2
Run the following commands in the project's root directory to set up your database and model.
To run ETL pipeline that cleans data and stores in database:
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.dbThe first two arguments are input data and the third argument is the SQLite Database in which we want to save the cleaned data. The ETL pipeline is in process_data.py.
DisasterResponse.db already exists in project's root directory folder but the above command will still run and replace the file with same information.
After the data cleaning process, run this command to run ML pipeline that trains classifier and saves ML classifier from the project directory:
python models/train_classifier.py data/DisasterResponse.db models/classifier.pklThis will use cleaned data to train the ML model, improve the model with grid search and saved the model to a pickle file (classifer.pkl).
classifier.pkl already exists but the above command will still run and replace the file will new information.
After data cleaning and creation of the ML model the ML model will be used to predict new messages direct in the Web App interface. Run the following command in the app's directory to run the web app.
Go the app directory and run the following command:
python run.pyThis will start the web app.
Go to http://0.0.0.0:3001/
Here you can enter messages and get classification results for it.
Screenshot 3 "Frontend"
Screenshot 4 "Backend"
. ├── app │ ├── run.py------------------------# FLASK FILE THAT RUNS APP │ ├── templates │ ├── go.html-------------------# CLASSIFICATION RESULT PAGE │ └── master.html---------------# MAIN PAGE OF WEB APP ├── data │ ├── │ ├── disaster_categories.csv-------# DATA TO PROCESS │ ├── disaster_messages.csv---------# DATA TO PROCESS │ └── process_data.py---------------# PERFORMS ETL PROCESS ├── images ---------------------------# PLOTS and SCREENSHOTS ├── models │ └── classifier.pkl----------------# ML MODEL │ └── train_classifier.py-----------# PERFORMS CLASSIFICATION TASK ├──DisasterResponse.db----------------# DATABASE TO SAVE CLEANED DATA │
The project uses Python 3.7 and additional libraries:
- pandas
- numpy
- sys
- time
- collections
- json
- re
- warnings
- operator
- pickle
- pprint
- flask
- nltk
- plotly
- scikit-learn
- SQLAlchemy
Classification metrics for the ML model:
-
Accuracy of the ML model is: 0.95 (accuracy is the fraction of samples predicted correctly)
-
Recall of the ML model is: 0.64 (also known as sensitivity; is the fraction of positive events that were predicted correctly)
-
f1-score of the ML model is: 0.70 (f1-score is the harmonic mean of recall and precision, higher score means a better model)
Link to understanding Data Science Classification Metrics in Scikit-Learn in Python
Link regarding Ffne tuning a classifier in scikit-learn
You will see the exact value after the model is trained by grid searchdirect in the command line.
Though the accuracy metric is high it has a poor value for recall. This ML model is not yet ready for production or the data input is not enough. But for this Show-Case the data and the ML model was great!
Screenshot 5
Classification Metrics:
Link to understanding Data Science Classification Metrics in Scikit-Learn in Python
Link regarding Ffne tuning a classifier in scikit-learn
Help:
Ideas, Help and Templates:
This project was completed as part of the Udacity Data Scientist Nanodegree. Code templates and data were provided by Udacity. The data was originally sourced by Udacity from Figure Eight.




