Skip to content

The objective of this project is to analyze a Marketing Campaign dataset provided by the instructor to understand customer behaviors affecting subscription rates for a magazine company.

Notifications You must be signed in to change notification settings

Aytaj9/-Understanding-Magazine-Subscription-Behavior

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

-Understanding-Magazine-Subscription-Behavior

Objective: The objective of this project is to analyze a Marketing Campaign dataset provided by the instructor to understand customer behaviors affecting subscription rates for a magazine company. Through the use of logistic regression and Support Vector Machines (SVMs) models, the project aims to predict subscription behavior accurately, identify significant attributes impacting the marketing campaign's effectiveness, and provide actionable recommendations for the magazine company based on the analysis.

Summary: This project delves into a Marketing Campaign dataset to address the decline in magazine subscriptions observed over the past year. The exploration begins with data cleaning techniques, including handling missing values and feature engineering to transform variables for modeling purposes. Categorical variables such as Marital Status and Education are categorized, and new features like 'Age', 'Spent', and 'Living with' are created to enrich the dataset. Outliers are identified and removed, leading to a refined dataset of 2212 observations with 24 variables.

Following data preparation, the analysis proceeds with logistic regression and SVMs modeling. The logistic regression model identifies significant predictors impacting subscription behavior, highlighting variables like 'AcceptedCmp3' and 'Is Parent'. The model achieves a training accuracy of 89.66% and a testing accuracy of 87.65%, with precision and recall rates indicating satisfactory performance, albeit with a recall rate slightly below the desired threshold.

In contrast, the SVMs model yields a higher testing accuracy of 88.86%, with precision at 76% and recall at 39%. Despite a lower recall rate compared to logistic regression, the SVMs model exhibits superior precision and overall accuracy.

Based on the comparison of model metrics, the SVMs model is recommended for implementation due to its higher accuracy and precision rates, providing actionable insights for the magazine company to optimize marketing campaigns and reverse the decline in subscriptions.

Introduction

This project is worked on Marketing Campaign data set provided by instructor, which is related to one magazine company’s attempts to understanding customer behaviors and determine that which factors have impact on their marketing campaign working according to the last year’s decrease in subscriptions of magazine. In this project, we are planning to use appropriate data cleaning techniques before starting to create a logistic and an SVM model to precisely predict subscription behavior and define which attributes are significant and how they put impact on business; after comparing the overall accuracy, precision and recall for both models and provide recommendation according to these mentioned factors for magazine company.

Exploratory Analysis From the descriptive statistics side, we firstly introduce our data set which contains 2240 observations with 29 variables: ID, Year Birth, Income, Kidhome, Teenhome, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVititsMonth, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, AcceptedCmp1, AcceptedCmp2, Complain, Z Cost Contract, Z Revenue, Response are numerical types; Education, Marital Status are categorical and Dt Customer is date format. From these variables, we can notice that there are 4 variable categories: Customer’s Information, Types of products, Marketing Campaign and Place. Our targeted-dependent variable is Response which contains 0 and 1 values. When checking missing values in data set, we identified that Income variable counts are 2216 which means that it has 24 missing value which just took 1% of whole data set; thus, I remove those missing values from the column. Now our data set contains 2216 observations with 29 variables. After removing missing values, I need to match Dt Customer attribute from date type to numerical type for using modelling. That’s why, we created a feature which shows the number of days the customer has been registered in the company’s database, for gaining the most recent customer in the record. For getting this value, I checked the newest customer’s enrolment and the oldest customer’s enrolment data in the database. For the newest customer’s enrolment date is 2014-06-29 while for the oldest customer’s enrolment date is 2012-07-30. After that, we can observe that feature which shows the number of days the customers began make a shopping in the store based on the last assigned date and we called this feature like ‘Customer For’. As we mention above, we have categorical variables: Marital Status and Education; now we need to know categories in these variable to match these variable for a logistic regression modelling. We identified that Marital Status variable contains Married, Together, Single, Divorced, Widow, Alone, Absurd, YOLO while Education consists of Graduation, PhD, Master, 2n Cycle, and Basic categories. As we know, having more categories cause more attributes when encoding these values into numerical variable in a logistic regression modelling. That’s we need to make some categorizing in that kind of variables. Not only categorical variable, we also make some feature adjustments for other variables: We extracted ‘Age’ feature from the ‘Year Birth’ variable which represented the birth year of the person; created ‘Spent’ feature which represents total spending amount by the customer in different categories over the span of two year (sum of MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts and MntGoldProds); created ‘Living with’ variable which represents Marital Status with two categories such ‘Partner’ and ‘Alone’ in which Married and Together values are already called ‘Partner’ while Absurd, Widow, YOLO, Divorced and Single are called ‘Alone’; created ‘Children’ feature which contains Kidhome and Teenhome; built ‘Family size’ feature which includes sum of ‘Living with’ and ‘Children’; created ‘Is Parent’ which contains number of children is greater than 0; and segmented ‘Education’ variables into three categories like ‘Basic’, ‘Graduation’ and ‘Postgraduate’. After these feature allocation, we removed already allocated variable from data: "Marital_Status","MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts","MntSweetProducts","MntGoldProds","Kidhome", "Teenhome", and some unnecessary variables like "Dt_Customer","Year_Birth", "ID" which don’t have any effects on our target variable. After this allocation, we have 24 variable with 2216 observations. Then I arranged ‘Response’ variable from the middle of data frame to the end of data set as target variable is better to be observed in the end of data set. From the summary statistics of numerical variable, we observed that average number of deals purchases and average number of website purchases are about 2 and 4, respectively, in spite that their maximum values are 15 and 27, respectively. We noticed the average number of website visits per month is 5 which 4 customers out of 5 are going to website purchase on average which is good. We identified absurdity in Age and Income variables in which average Age of customers is 52 while maximum age level reach 128 which shows discrepancies. Furthermore, Income variable’s average equals 52247 while maximum level reaches 666666 which shows disprenacies in variable. Therefore, we decided to observe plots of some variables: Income, Recency, Customer For, Age, Spent and Is Parent in which we selected Is Parent as category of these variables. From the those graphs, we determined outliers in Age and Income variables where there are some values above 90 in Age and some values above 600000 in Income variable. We decided to remove those outliers to have high quality data and after removing those noisy data we have 2212 observations. Before starting to build the logistic regression model, it is better to observe relationship between two variables; thus we created correlation plots for all variables and correlation table to see exact correlation score for all variables. Unfortunately, having many variables prevented to observe all variables correlations in which some variables showed ‘NaN. We defined that Income and Spent variables have strong positive correlation with 0.79 score; Spent and NumCatalogPurchases have strong positive correlation with 0.78 score; Spent variable have moderate negative correlation with Is Parent, Children and Family Size; Income variable have high positive correlation with NumWebPurchases, NumCatalogPurchases, NumStorePurchases, and Spent while have high negative correlation with NumWebVisitsMonth and moderate negative correlation with Children, and Is Parent variables. I didn’t observe strong correlation between special variable and Response variable. Let’s start to build a logistic regression and SVM model to observe how the variables show effects for Response variable.

Analysis Before start to create a logistic regression model, we transformed categorical variables to numerical data and assign Predictors and Target variable for our model. Then we divided our data into training and testing data set with 70:30 ratio. Before build the logistic model, we decided to select variable based on Variance Inflation Factor (VIF) which if it is between 5 and 10, that indicates that there is multicollinearity among variables, and we should remove those variables. Fortunately, we didn’t identify any multicollinearity among variables, and thus we keep 23 independent variables as predictors of our logistic regression model. After standardizing independent variables and managed imbalanced data in the ‘Response’ variable, we built Logistic Regression model using statistical modeling method. We identified NumCatalogPurchase, AcceptedCmp4, Complain, Living with, Children and Family size are not significant variable which means that their p-value are higher than significance level-0.05. ‘ Z CostContact’ and ‘Z Revenue’ variables’ p value showed ‘nan’ and thus we couldn’t identify their significance level. ‘AcceptedCmp2’ and ‘Age’ variable shows imbalanced significance level like having chance to increase accepted significance level. Among the most significant variables, we noticed that ‘AcceptedCmp3’ show the highest positive impact on Response variable which means that when this predictor increases, the target variable will increase in same log odds. From the most significance variable, ‘Is Parent’ showed the highest negative impact on Response variable which indicates that when this predictor increases, Response variable will results decrease in same log odds. AcceptedCmp5, AcceptedCmp1 and AcceptedCmp2 have moderate positive impact on the Response variable while Education variable have moderate negative influence on the Response variable. From those case, we can think that accepted campaign provide successful result for magazine subscription. Furthermore, we can think that higher education people are not tended to believe marketing and magazine subscriptions. It is interesting fact that, we notice that ‘Is Parent’ variable is negative effector for Response variable which indicates that people with having children are not interested that kind of actions. Our model accuracy for training data set is 0.8966 while for testing data is 0.8765 or 87.65% which is really good. After reviewing confusion matrix result, we identified that True Positive is 42, True Negative is 540, False Positive is 23 which shows Type I Error and False Negative is 59 which represent Type II Error. Our True Positive is higher than False Positive and True Negative is also higher than False Negative. In that case, we can proceed with this model. Classification report for this logistic regression model, we identified that precision equals 0.65 or 65% while recall shows 0.42 or 42%. Precision represents 65% of positive identifications are actually correct while Recall shows us 42% of actual positives are defined correctly. We can accepted that Precision level for model, but below than 50% for Recall seems risky for us in spite that having high accuracy rate. In that case, I will remove insignificant variable to reach better result as from my side, every variables have impact on that result, removing some variables don’t provide any guarantee better result. I am going to build Support Vector Machines (SVMs) which is used in classification and regression as supervised method. I created a SVM model with gamma= 0.025 which is measure of influence and C=3 which represents complexity of model. After fitting the model into the training data set, we generate a SVM model; therefore, we identified the performance of model with testing data set and accuracy of this model equals 0.8886 or about 89% which is really good. From the Confusion matrix of SVM model, we observed that True Positive is 39, True Negative is 551, False Positive is 12 and False Negative is 62. From the classification report, we observed Precision for SVM model is 0.76 or 76% which is higher than Logistic regression model Precision (65%) and Recall is 0.39 or 39% which is below than Logistic regression model Recall (42%). When comparing the accuracy of two models, we identified that they are almost same with 88%, Logistic regression is little bit below than SVM in accuracy; SVM model is also higher than Logistic regression model in Precision and having so close Recall rate.

Conclusion The purpose of assignment has been successfully achieved. Using the logistic regression model and Support Vector Machines (SVMs) model, we have identified the most significant variables and how they affect the Response variable which helps a magazine company to define main reasons last year’s decline in subscriptions and make a campaigns according to customers’ behaviors. Furthermore, after comparing two models metrics: accuracy, precision and recall rate, we decide to take implementation based on a SVM model with high accuracy and precision rate to reach better outcome.

About

The objective of this project is to analyze a Marketing Campaign dataset provided by the instructor to understand customer behaviors affecting subscription rates for a magazine company.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published