Bank Customer
Churn Prediction

  • Developed algorithms for bank managers to predict customer churn probability based on labeled data via Python programming

  • Preprocessed dataset by data cleaning, feature transformation (encoding) and feature standardization

  • Trained supervised machine learning models including Logistic Regression, Random Forest, K-Nearest Neighbors, Naïve Bayes and Support Vector Machine, and applied regularization with optimal parameters to overcome overfitting

  • Evaluated model performance (accuracy or F1-score) of classification by k-fold cross-validation technique, and analyzed feature importance to identify top factors that influenced results such as age, estimated salary, credit score and balance

Background

Customer churn, which is defined as customer who cancels a user account, is a common business metric across different industries, like telecommunication, banking, SaaS, etc. Therefore, it is very important to know how to analyze this metric, which will influence the company strategy and future. In this project, we will use different supervised machine learning models to predict if a bank customer will churn or not and do further analysis, then we will also figure out the key factors related to customer churn, and guide the company to do better business actions to retain valuable customers.

Keywords: Customer Churn Prediction and Analysis, Supervised Machine Learning, Classification Problem, scikit-learn

[data source link] https://www.kaggle.com/adammaus/predicting-churn-for-bank-customers

Workflow
  1. Problem Clarification and Framing
  2. Data Collection and Exploration
  3. Feature Preprocessing and Feature Engineering
  4. Model Training and Evaluation
  5. Feature Selection and Model Iteration
  6. Model Deployment and Productionalization
Data Exploration

Let us have a look at raw data first.

Metadata

According to the metadata, we find that it is a very clean dataset with 10,000 entries and  without null values.

Unique Values in Each Column

Then we check how many unique values each column has.

Exited is the label for customer churn, 1 for exit and 0 for non-exit.

RowNumber is just a count of customers, it does not have any physical meanings.

CustomerId is a unique column for customers’ identity. Whether it has a physical meaning depends: in some banks, the smaller the ID is, the earlier the customer registered, or the ID is related to geographical information, from which we can get the user age or user location; in others, however, the ID is randomly genarated, which has no specific information. Normally, we need more background information from a bank to decide whether ID is an effective column. In this project, we just directly drop it.

Surname is also an ambiguous variable. Typically, we do not use user name as an indicator to analyze user behavior, unless in special situation related to names. And actually, in most banks, user name will be masked due to privacy consideration. In this project, we will also ignore it.

6,382 unique numbers are shown in Balance out of 10,000 customers. It is because that many customers have a zero balance.

After simple filtering, we have 6 numerical (CreditScore, Age, Tenure, NumOfProducts, Balance, EstimatedSalary) and 4 categorical (Geography, Gender, HasCrCard, IsActiveMember) variables left.

First of all, we want to check whether there is an obvious difference in distribution between customers who exit and who do not exit on the numerical variables. We can get some initial impressions and hypotheses.

  1. It looks like CreditScore, NumOfProducts and EstimatedSalary have little effect on whether a customer will exit.
  2. Customers in higher Age group are more likely to exit. Maybe this is because they have more balance and are more willing to choose a new product. This is an insight from data visualization, we can make some adjustments on our products or strategy based on it.
  3. From Tenure boxplot, we find that customers with very short or very long user age prefer to exit. We can hypothesize that customers with very short user age are not satisfied with the products, and that those with very long user age encounter a bottleneck in their products and want to try new ones.
  4. In Balance variable, we can observe that customers with low balance, especially close to zero balance, are less likely to exit. It also confirms the conclusion we made before, which is that there are lots of customers with zero balance (much less unique values in Balance than those in CustomerId). It makes sense that it is totally unnecessary for customers with little balance to exit and change a new product.

We also check the correlation between different variables. Basically, there is no strong correlation between any two of numerical features from the heat map. Actually, it depends on domain knowledge to decide whether the correlation is high or low. In certain situation, if correlation is not high enough, it is not suggested to drop any feature because it still contains some new information.

Then we check the information on categorical variables. It looks like whether having a credit card or not is not an obvious indicator for customer churn. German customers are more likely to drop the products than French or Spanish customers, with twice percentage of churn. Female customers cares products more than male customers, with around 10% more churn. Meanwhile, active members also have around twice percentage of churn compared with inactive members.

Feature Engineering

To implement further analysis, we need to process the raw dataset. As mentioned before, RowNumber, CustomerId, and Surname are useless columns and we will directly drop them. Exit column will be separately stored as the label for model building. Then we get the initially processed dataset with only attributes we need.

However, we still cannot use this dataset because computer cannot directly process the text information. It is necessary to transfer the texts into numbers. We will transfer the Gneder column by Ordinal Encoding and the Geography column by One-Hot Encoding. Now we get the dataset which can be processed by the computer.

Then we need to split the dataset into 80% training set and 20% testing set by stratified sampling and process it with some methods such as standardization or normalization.

  1. If we do not use stratified sampling by percentage of labels, it might cause imbalanced sampling results. Suppose we have 8,000 positive cases and 2,000 negative cases, the most extreme situation, though very low in possibility, is that all positive cases are sampled as training set and all negative cases are sampled as testing set, which will definitely generates a bad model.
  2. If we do not split data before standardization or normalization, it will cause training data information leakage into testing data when computing min/max or mu/sigma. We have to pretend to be unable to see testing data when training the model.
  3. If we do not standardize or normalize data, it will have the scaling problem. It means that attribute with large magnitude will cover the influence of attribute with small magnitude. It is unreasonable to recognize feature importance purely by value magnitude so that we have to avoid it. And we need to emphasize that since we should not have access to testing data, thus we can only use the statistical information of training data to standadize or normalize testing data.

Finally, we get the perfect dataset for model building.

Model Comparison

Typically, it is a challenge to know what values to use for the hyperparameters of a given algorithm on the dataset, therefore it is common to use grid search strategy for different hyperparameter values.

Logistic Regression

Main parameters of Logistic Regression are penalty method chosen from L1 or L2, and C which controls the strength of penalty term.

On this dataset, we find that the model with C = 0.2 using L1 penalty term is the best one, with 0.81 in accuracy.

Searching Space
Best Logistic Regression Model (AUC = 0.778)
K-Nearest Neighbors

In K-Nearest Neighbors, the most important hyperparameter is n_neighbors, which controls the number of voting neighbors for the given data. The number shoud be odd in case that there is a tie. It is also interesting to test different distance metrics, which can be chosen from Euclidean, Manhattan or Minkowski, for choosing the composition of the neighborhood.

In this model, the best number of neighbors is 7 and Euclidean distance is a better choice, with accuracy 0.84.

Searching Space
Best K-Nearest Neighbors Model (AUC = 0.787)
Random Forest

Rnadom Forest model has hyperparameters including n_estimators, which is the number of trees in the forest, max_depth, which restricts the maximum depth of each tree to avoid overfitting, criterion, which can be chosen from gini or entropy as information gain, and bootstrap, which decides whether to use all data in each tree.

As we can see, the accuracy of model with best hyperparameters is 0.86.

Searching Space
Best Random Forest Model (AUC = 0.860)
Naive Bayes

There is no special hyperparameter in Naive Bayes.

The accuracy on this dataset is 0.82 and AUC is 0.785.

Naive Bayes Model (AUC = 0.785)
Support Vector Machine

The most important hyperparameters of Support Vector Machine are kernel, which can be chosen from linear or rbf and controls the manner in which the input variables will be projected, and C, which takes on a range of values and has a dramatic effect on the shape of resulting regions for each class.

For this model, using RBF kernel with C = 2.5 is obviously the best choice. We can get a model with 0.86 in accuracy.

Searching Space
Best Support Vector Machine Model (AUC = 0.819)
Conclusion

We choose accuracy as the metric for evaluating the model performance. More complex models, such as Random Forest and Support Vector Machine, do have a higher accuracy than relatively simpler models such as Naive Bayes and Logistic Regression. Th performance of K-Nearest Neighbors is between them. However, during the fine-tuning of hyperparameters, we have to admit that more complex models are more time comsuming. There is always a trade-off between time efficiency and model performance.

Feature Selection
Feature Importance by Logistic Regression
Feature Importance by Random Forest

According to the data visualization, we remember that age, gender, balance and geography might be important indicators of customer churn. After model training and comparison, we try to figure out feature importance from algorithms with this functionality, such as Logistic Regression with L1 penalty and Random Forest. Based on Logistic Regression, age is the most important indicator. Whether being an avtive member, location, gender and balance are also significant factors, which also proves that conclusions from data visualization make sense. Meanwhile, Random Forest also shows that age is the crucial point for customer churn. Salary, credit score and balance are also key factors according to Random Forest. Combining those results, bank managers can effectively make decisions dealing with customer churn problem. They should put the age distribution in the first place and get insights of customer behaviors in different ages. And with the help of other key factors chosen by different business situations, they can handle the customer behaviors with more confidence.