Skip to content

Latest commit

 

History

History
111 lines (72 loc) · 3.13 KB

File metadata and controls

111 lines (72 loc) · 3.13 KB

Data Cleansing

Objective

Learn cleaning up messy data

Reference

Essential Reading

Data Cleansing

Handling Missing Data

Extra Reading

Checklist

After completing the exercises below, you should be comfortable with

  • Explore dataset to figure out if the data needs cleansing
  • Identify missing data
  • Identify invalid data
  • Cleanse data

Exercises

Difficulty Level

★☆☆ - Easy
★★☆ - Medium
★★★ - Challenging
★★★★ - Bonus

A - Handling Missing Data

A1 - Handling Missing Data

We have the following data. What would be a good way to handle missing data? Please discuss the following choices:

  • Drop the rows with missing data
  • Substitute zero for missing values
  • Substitute another value for missing values (if so what is that value?)
year     month   rainfall
2019     Jan     10
2019     Feb     12
2019     Mar     ?
2019     Apr     20
2019     May     ?

A2 - Handling Missing data

How will you handle missing data in this dataset?

Person   Height_cm
A         180
B         ?
C         172
D         155
E         160
F         ?

Exercicses

EX-1 : Cleanup data (college admission) (★☆☆)

  • Start with this notebook
  • Complete the TODO items

EX-2 : Cleaning up House Sales Data (★★☆)

  • Read the house-sales-simplified.csv.
  • Identify columns that need data cleanup
    Hint : Zipcode
  • Convert SaleDate to actual date type
  • Do a barplot of houses sold per year
  • What percentage of data is clen?

More Exercices