Predicting Credit Card Customer Attrition (Churn)

Overview #

Customer churning (or customer attrition rate) is a problem for any business in the service industry, you only make money by keeping customers interested in your product. In the financial service industry this usually takes the form of credit cards and so the more people that use their credit card service, the more money they will make.

This project compares the performance of several different classification models (Logistic Regression, Random Forest, Decision Tree, and XGBoost) using Principal Component Analysis (PCA) feature reduction and SMOTE up-sampling (Synthetic Minority Oversampling Technique, for balancing the dataset) in order to create the best model possible for predicting which customers are going to drop the banks credit card and leave for a competitor. For model scoring, the Recall metric will be used to see which model performed the best with the data.

This notebook explores several different types of classification models (see above) along with utilizing PCA feature reduction and SMOTE (to balance the dataset) to create the best model possible for predicting if a customer will drop a banks credit card (churn).

View Jupyter Notebook #

The notebook goes step-by-step through the project, please follow the directions and cell order if you would like to replicate the results.

  • Click the "View Notebook" button to open the rendered notebook in a new tab
  • Click the "GitHub" button to view the project in the GitHub portfolio repo
View Notebook GitHub

Methodology #

Each classification model type (Logistic Regression, Random Forest, XGBoost, and Decision Tree) will be trained and have their hyperparameters fine-tuned with PCA feature reduction. Due to the nature of the modeling type, classification, there will be a lot of variables that need to be OneHotEncoded in order to be used in the modeling process. SMOTE up-sampling will also need to be done in order to balance the dataset so that it doesn’t cause the models to have a bias toward any particular person in the dataset.

Exploratory Data Analysis (EDA) #

The following are several graphs were created to get a better idea of the overall structure and distribution of the data in the dataset. This helps to see anything that might need to be accounted for or changed before starting analysis and modeling.

Variable Histograms #

Histograms of the numerical variables to see their overall distributions in the dataset:

Data Proportions #

We can see that while the gender of the dataset is relatively balanced, the majority of the customer data we have is of existing customers. Thus, this is where SMOTE
comes in to help balance the data by up-sampling the attrited samples to match them with the regular customer sample size. This should balance out the skewed data and also help improve the performance of the models.

The majority of customers are making 60k or less and have either completed high school or have a graduate degree.

SMOTE Up-sampling and Feature Reduction (PCA) #

SMOTE up-sampling just involves using the SMOTE function from the imblearn library on the relevant column in the data, in this case the Attrition_Flag column.

Once up-sampling is completed we can take a look at PCA for reducing the number of the features in the dataset, see below:

The graph above shows the explained variance of each PCA component, along with the cumulative sum of the components above. Looking at the values above, using 8 of the 17 PCA components could be a good idea because it reduces the total number of encoded features by over half, while still explaining roughly 80% of the encoded data. With up-sampling and feature reduction completed, we can move on to splitting the dataset into train/test sets, creating and training the models.

Tuning Model Hyperparameters #

The python library sklearn has a function that allows you to run your models with a multitude of different parameters to find the best ones for your model. The function in question is GridSearchCV, it makes fine-tuning models and getting the best possible hyperparameters soooo much easier.

GridSearchCV Example #

Below is an excerpt from the linked notebook for this project on what GridSearchCV looks like in practice, in this case using Logistic Regression.

Results - XGBoost #

After training all the models, XGBoost classifier is the model that came out on top with a recall score of ~0.98:

Best Parameters #

The best parameters for the XGBoost model was the following:

Best XG Boost Classifier Parameters
===================================
              booster: gbtree
     colsample_bytree: 0.8
          eval_metric: logloss
      importance_type: weight
        learning_rate: 0.2
            max_depth: 6
           reg_lambda: 0.2
            subsample: 0.8
    use_label_encoder: False

Confusion Matrix #

After the best model was selected, the model was used to predict on the un-sampled dataset (no up-sampling) to see how it would perform and the model ended up performing very well, see the matrix below:

Precision-Recall Curve #

Conclusion #

From the above Confusion Matrix and Precision-Recall Curve graphs, it is evident that the XGBoost Classifier (with tuned hyperparameters) performed very well with the data and made some very good predictions using the test set and original dataset (without up-sampling). Due to all of the analysis and the final results, I am quite confident that this XGBoost Classifier model will perform very well for the bank for predicting customer attrition with their credit card.