[ Project Description ] [ Project Goal ] [ Initial Thoughts ] [ The Plan ] [ Acquire & Prep ] [ Explore ] [ Data Dictionary ] [ Modeling ] [ Steps to Reproduce ] [ Conclusion ] [ Meet the Team ]
The Consumer Financial Protection Bureau (CFPB) has a consumer complaint database that is a collection of complaints about consumer financial products and services that they sent to companies for a response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the consumer complaint database.
This database is not a statistical sample of consumers’ experiences in the marketplace. Complaints are not necessarily representative of all consumers’ experiences and complaints do not constitute “information” for purposes of the Information Quality Act.
Complaint volume should be considered in the context of company size and/or market share. For example, companies with more customers may have more complaints than companies with fewer customers. CFPB encourages users to pair complaint data with public and private datasets for additional context.
The Bureau removes PII and publishes the consumer’s narrative of their experience if the consumer opts to share it publicly. CFPB doesn’t verify all the allegations in complaint narratives. Unproven allegations in consumer narratives should be regarded as opinions, not facts. CFPB does not adopt the views expressed and makes no representation that the consumers’ allegations are accurate, clear, complete, or unbiased in substance or presentation. Users should consider what conclusions may be fairly drawn from complaints alone.
This project aims to predict a company's response to a complaint made by a consumer to the CFPB to see if the wording of a complaint can affect the response from a company.
- There are going to be keywords that match a company's response.
- Sentiment analysis will not be useful because most complaints will likely be negative.
- Here is a slide presentation summarizing findings in exploration and results of modeling.
- Data acquired from Google BigQuery
- 3,458,906 rows × 18 columns before cleaning
- Clean the data
- Drop Columns
- Rename columns
- Remove nulls
- Fixed data type
- Create engineered columns from existing data
- Bin products
- Process narrative into clean and lemon
- Bin company responses
- Encode categorical columns
- Split data (60/20/20)
Show more preparation details
- Dropped Columns
- product
- 0% null
- ENGINEERED FEATURE
- bin related products/services together then drop
- bins = credit_report, credit_card, debt_collection, mortgage, bank, loans, and money_service
- subproduct
- 7% null
- issue
- 0% nulls
- 165 unique values
- Planned use in future iteration
- subissue
- 20% null
- 221 unique
- consumer_complaint_narrative
- 64% null
- renamed to narrative
- drop all null values
- drop after NLTK cleaning
- company_public_response
- 56% null
- related to target
- Bin into:
- Relief: Monetary Relief Response and Non Monetary Relief Response
- No Relief: Closed with Explanation
- Dropped: Untimely Response and Closed
- zip_code
- 1% null
- mixed data types
- Planned use in future iteration
- consumer_consent_provided
- 25% null
- does not relate to the target
- submitted_via
- 0% null
- does not relate to the target
- date_sent_to_company
- 0% null
- Planned use in future iteration
- 0% null
- timely_response
- 0% null
- boolean
- Planned use in future iteration
- consumer_disputed
- 77% null
- complaint_id
- 0% null
- product
- Cleaned Columns
- date_received
- 0% nulls
- changed date to DateTime
- keep for purposes of exploration
- company_name
- 0% nulls
- 6,694 Companies
- state
- 1% null
- keep for purposes of exploration
- impute 1% null into UNKNOWN label
- tags
- 89% null
- impute nulls with "Average Person label
- 89% null
- company_response_to_consumer
- Target
- 4 nulls = 0%
- Drop these 4 rows because this is the target column
- 8 initial unique values
- future: apply the model to in_progress complaints and see what it predicts based on the language
- Drop 'in progress' response because there is no conclusion
- date_received
1,238,536 rows x 7 columns
Used NLTK to clean each document resulting in:
- 2 new columns: clean (tokenized, numbers/specials, and XX's removed) and lemon (removed stopwords, kept real words, and lemmatized)
Selected columns to explore after cleaning:
- date_received, product_bins, company_name, state, tags, response, lemon
1. Are there words that get particular responses and is there a relationship?
- What are the payout words that got relief from the company?
2. Do all responses have a negative sentiment?
- Do narratives with a neutral or positive sentiment analysis relating to bank account products lead to relief from the company?
3. Are there unique words associated with no relief from the company?
4. Which product is more likely to have relief?
Here is a link to the CFPB's official data dictionary
| Feature | Definition |
|---|---|
| date_received | Date the complaint was received by the CFPB |
| product | The type of product the consumer identified in the complaint |
| subproduct | The type of sub-product the consumer identified in the complaint |
| issue | The issue the consumer identified in the complaint |
| subissue | The sub-issue the consumer identified in the complaint |
| consumer_complaint_narrative | A description of the complaint provided by the consumer |
| company_public_response | The company's optional public-facing response to a consumer's complaint |
| company_name | Name of the company identified in the complaint by the consumer |
| state | Two-letter postal abbreviation of the state of the mailing address provided by the consumer |
| zip_code | The mailing ZIP code provided by the consumer |
| tags | Older American is aged 62 and older, Servicemember is Active/Guard/Reserve member or spouse |
| consumer_consent_provided | Identifies whether the consumer opted in to publish their complaint narrative |
| submitted_via | How the complaint was submitted to the CFPB |
| date_sent_to_company | The date the CFPB sent the complaint to the company |
| company_response_to_consumer (target) | The response from the company about this complaint |
| timely_response | Indicates whether the company gave a timely response or not |
| consumer_disputed | Whether the consumer disputed the company's response, discontinued as of April 24, 2017 |
| complaint_id | Unique ID for complaints registered with the CFPB |
| product_bins | Engineered Feature: bin related products together |
| clean | Engineered Feature: tokenized, removed numbers/specials and XXs from privacy sanitization |
| lemon | Engineered Feature: removed stopwords, kept real words, and lemmatized the clean column |
| response | Engineered Feature: binned company_response_to_consumer |
Companies can categorize their response to a complaint in a number of ways.
- Closed with monetary relief: The steps taken included objective, measurable, and verifiable monetary relief to the consumer as a direct result of the steps taken or that will be taken in response to the complaint.
- Closed with non-monetary relief: The steps taken by the company in response to the complaint did not result in monetary relief, but may have addressed some or all of the consumer’s complaint involving non-monetary requests.
- Closed with explanation: The steps taken by the company in response to the complaint included an explanation that was tailored to the individual consumer’s complaint. For example, this category would be used if the explanation substantively meets the consumer’s desired resolution or explains why no further action will be taken.
- Closed: The company closed the complaint without relief – monetary or non-monetary – or explanation.
- In progress: The company’s indication that the complaint could not be closed within 15 calendar days and that its final responsive explanation to the consumer will be provided through the portal at a later date
- Untimely Response: The company is taking longer than 15 days to provide a response.
-
Calculated the sample size for each class category using a 20% sampling rate.
- not worried about the veracity of the data.
-
Created smaller datasets by sampling the specified number of samples from each class category.
- TF-IDF w/ monograms,bigrams and trigrams
- Decision Tree
- Multi-Layer Perceptron
- Linear Support Vector Classification
- Recall
- Accuracy
- Baseline: 79.31%
- Top 2,900 words in 'lemon' column
- Encoded features
- tags
- product_bins
- Clone this repo
- You may need to update your Python Libraries, our libraries were updated in June 2023
- For a relatively quick run (possibly 10+ min, depends on system resources)
- Verify
df = wrangle_complaints()is in the cell after imports of final-report - Run the final-report notebook
- This will use a pre-built and cleaned dataset that would be produced from the longer steps below
- Even after cleaning, the data amounts to just under 1GB
- Runtime may take some time due to sentiment analysis and modeling
- Verify
3) For the longer run *(possibly 30+ min, depends on system resources)*
⚠️ WARNING⚠️ : These are basically the same steps we took to originally acquire the data. The actions take a lot of time (and disk space) and may not even be the best way. We highly recommend doing the quick run in step 2 unless you want to know how we got the data and experience the long wait.- Verify
df = wrangle_complaints_the_long_way()is in the cell after imports of final-report - Install the pandas-gbq package through the terminal/command line
pip install pandas-gbq
- Go to Google BigQuery and create a project
- Copy the 'SQL_query' variable found in
wrangle.pyand run in Google BigQuery - Save the result as a BigQuery table in your project
- You can look in
wrangle.pyfor what we named the project, database, and table (we kept the database and table names the same as the original)
- You can look in
- Edit and save the 'SQL_query' variable found in
wrangle.pyto the respective table names in your BigQuery project using this format:FROM database. tableand edit the 'project_ID' variable to your project's ID - Create a Service Account and key for your project
- Save the key as
service_key.jsoninto the local repo
- Save the key as
- Copy the 'SQL_query' variable found in
- Run the final-report notebook and be patient
- It may ask for authentication when it tries to query Google BigQuery
- Try to run again if it stopped for authentication
- This will run through the longer pathway of getting the data from the source and performing the cleaning and natural language preparation
- It will probably take a while (3.5 million rows, +2GB), hence we do not recommend
-
The analysis explored relationships between complaint words and responses, as well as the sentiment's influence on response types. However, no significant correlations were found between specific words and responses, and sentiment did not consistently impact the response type.
-
Unique words associated with each product category were identified, offering insights for response prediction. Certain product categories had higher chances of receiving relief responses, while others had lower.
-
We believe the Linear Support Vector Classification is a good median. The validate accuracy score was 79.46% and recall score was 99.23%
- Hyperparameters: C set 0.1 and dual set to False
-
With more time we would experiment with different features, types of n_grams combinations, and metrics to improve our model's prediction performance in different areas
- We decided to run the SVC model on the test data, and it gave us an accuracy score of 79.43% and recall score of 99.22%
- Enhance Response Analysis: The project highlights the need to analyze company responses to consumer complaints. Consider investing in natural language processing (NLP) techniques to extract meaningful insights from response data. By understanding the patterns and sentiments in responses, it might be possible to identify areas for improvement and optimize customer interactions.
- Monitor Sentiment and Product Categories: Pay attention to sentiment analysis of consumer complaints across different product categories. Identify trends in sentiment and response types to understand customer expectations and tailor the response strategies accordingly. This can help to improve the overall customer experience and target specific pain points in different product categories.
- Address Discrimination and Bias: Conduct further analysis on zip codes, states, and company responses to identify potential discrimination or bias in the complaint resolution process. Ensure fairness and equality by addressing any disparities and taking appropriate actions to eliminate discriminatory practices.
- Identify Industry Trends: Look for industry-specific trends by analyzing complaints related to specific companies, such as Silicon Valley Bank and Bank of America. This analysis can help identify emerging issues, detect patterns of non-compliance, and proactively address potential risks.
- Continuous Improvement: Treat the project as a starting point and continuously refine the complaint resolution processes. Regularly review customer feedback, complaints, and company responses to identify areas for improvement. Implement a feedback loop to integrate customer insights into operations and drive continuous improvement initiatives.
GitHubs: [ Alexia Lewis ] [ Rosendo Lugo ] [ Chellyann Moreno ] [ Tyler Kephart ]