Skip to content

Data Mining, ETL Project using Consumer Complaint Dataset

License

Notifications You must be signed in to change notification settings

gyhou/consumer_complaints

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Consumer Complaints

Table of Contents

  1. Problem
  2. Summary
  3. Input Dataset
  4. Expected output
  5. Repo directory structure
  6. Testing the code

Problem

The federal government provides a way for consumers to file complaints against companies regarding different financial products, such as payment problems with a credit card or debt collection tactics. This project will be about identifying the number of complaints filed and how they're spread across different companies.

For this project using only built-in Python libraries, we want to know for each financial product and year, the total number of complaints, number of companies receiving a complaint, the company with the most complaints, and the highest percentage of complaints directed at a single company.

Summary

Jupyter Notebook version

In the run.sh script, python3.7 is used, followed by the python script file location and name, then the input csv file location and name, then the desired output csv file location and name.

consumer_complaints.py has 2 parts: First, process the csv file, then aggregate the processed data and create a new csv file.

Part 1

process_csv(file_loc): Takes in an input csv and returns a dictionary with processed data.
Takes in 1 argument:

  • file_loc: The file location to extract the csv from
  1. Check for missing columns (Product, Company, Date Received)
  2. Sort the data by product (alphabetically) and year (ascending)
  3. Create and returns a dictionary with (product, year) as key
    • The value is another dictionary {company_1: number of complaints} for that (product, year)
    • Lower case both product type and company name
    • Extract year from "Date received"

Part 2

output_csv(dict_data, save_loc): Takes in the processed data and creates an output csv file.
Takes in 2 arguments:

  • dict_data: The dictionary with the processed data to covert into csv
  • save_loc: The location and name to save the csv file to
  1. Set fieldnames for the csv file ('product', 'year', 'num_complaint','num_company', 'most_complaints', 'highest_percent')
  2. Create an output csv file
    • Read the dict_data and insert a row for each distinct (product, year)
    • Refer to Expected output for more detail

Input dataset

Data Source used in this project from Data.gov.

The code will read an input file, complaints.csv, at the top-most input directory of the repository, process it and write the results to an output file, report.csv to the top-most output directory of the repository.

Each line of the input file, except for the first-line header, represents one complaint. Consult the Consumer Finance Protection Bureau's technical documentation for a description of each field.

  • Notice that complaints were not listed in chronological order

For the purposes of this project, all names, including company and product, should be treated as case insensitive. For example, "Acme", "ACME", and "acme" would represent the same company.

Expected output

After reading and processing the input file, the code will create an output file, report.csv, with as many lines as unique pairs of product and year (of Date received) in the input file.

Each line in the output file should list the following fields in the following order:

  • product - type of product the consumer identified in the complaint (written in all lowercase)
  • year - year the CFPB received the complaint
  • num_complaint - total number of complaints received for that product and year
  • num_company - total number of companies receiving at least one complaint for that product and year
  • most_complaints: company with most complaints for that product and year
  • highest_percent - highest percentage (rounded to the nearest whole number) of total complaints filed against one company for that product and year. Using standard rounding conventions (i.e., Any percentage between 0.5% and 1%, inclusive, should round to 1% and anything less than 0.5% should round to 0%)

The lines in the output file will be sorted by product (alphabetically) and year (ascending).

  • When a product has a comma (,) in the name, the name should be enclosed by double quotation marks (")
  • Percentages are listed as numbers and do not have % in them.

Repo directory structure

├── README.md
├── run.sh
├── src
│   └── consumer_complaints.py
├── input
│   └── complaints.csv
├── output
|   └── report.csv
└── testsuite
    └── tests
        └── test_1
        |   ├── input
        |   │   └── complaints.csv
        |   ├── output
        |   │   └── report.csv
        └── my-own-tests
            ├── input
            │   ├── complaints.csv
            │   ├── test1_complaints.csv
            │   ├── test2_complaints.csv
            │   ├── test3_complaints.csv
            │   └── test4_complaints.csv
            |── output
            │   ├── report.csv
            │   └── report_test.csv
            ├── consumer_complaints_test.py
            └── consumer_complaints.py

Testing the code

The testsuite directory showcase input tests for the code. Under that directory, test_1 contains the sample input and output files, my-own-tests contain an unittest file consumer_complaints_test.py to test various csv input files.

Unit test consumer_complaints_test.py tests:

  1. If the output csv is the same as sample output
  2. If the input csv has missing column
  3. If the input csv has non-int year value
  4. If the input csv has value error for year
  5. If the input csv has value error for product and company

About

Data Mining, ETL Project using Consumer Complaint Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published