Kaggle- Santander Customer Satisfaction

April 22, 2023

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don’t stick around. What’s more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer’s happiness before it’s too late.

Goal: We predict if a customer is satisfied or dissatisfied with their banking experience.
Metric: Area under the ROC curve between the predicted probability and the observed target.
Model : XGBoost and LightGBM
Data
- Target: It equals one for unsatisfied customers and 0 for satisfied customers.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings('ignore')

Dataset

cust_df = pd.read_csv("input/santander-customer-satisfaction/train.csv", encoding='latin-1')
print('dataset shape:', cust_df.shape)
cust_df.head(3)
#target : class label, 1=dissatisfaction, 0=satisfaction.
# # of features: 371

dataset shape: (76020, 371)

	ID	var3	var15	...	var38
0	1	2	23	...	39205.17
1	3	2	34	...	49278.03
2	4	2	23	...	67333.77

3 rows × 371 columns

cust_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76020 entries, 0 to 76019
Columns: 371 entries, ID to TARGET
dtypes: float64(111), int64(260)
memory usage: 215.2 MB

111 float features, 260 integer features, all features are numerical.

# The distribution of features

cust_df.describe()

	ID	var3	var15	imp_ent_var16_ult1	imp_op_var39_comer_ult1	imp_op_var39_comer_ult3	imp_op_var40_comer_ult1	imp_op_var40_comer_ult3	imp_op_var40_efect_ult1	imp_op_var40_efect_ult3	...	saldo_medio_var33_hace2	saldo_medio_var33_hace3	saldo_medio_var33_ult1	saldo_medio_var33_ult3	saldo_medio_var44_hace2	saldo_medio_var44_hace3	saldo_medio_var44_ult1	saldo_medio_var44_ult3	var38	TARGET
count	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	...	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	76020.000000	7.602000e+04	76020.000000
mean	75964.050723	-1523.199277	33.212865	86.208265	72.363067	119.529632	3.559130	6.472698	0.412946	0.567352	...	7.935824	1.365146	12.215580	8.784074	31.505324	1.858575	76.026165	56.614351	1.172358e+05	0.039569
std	43781.947379	39033.462364	12.956486	1614.757313	339.315831	546.266294	93.155749	153.737066	30.604864	36.513513	...	455.887218	113.959637	783.207399	538.439211	2013.125393	147.786584	4040.337842	2852.579397	1.826646e+05	0.194945
min	1.000000	-999999.000000	5.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.163750e+03	0.000000
25%	38104.750000	2.000000	23.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.787061e+04	0.000000
50%	76043.000000	2.000000	28.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.064092e+05	0.000000
75%	113748.750000	2.000000	40.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.187563e+05	0.000000
max	151838.000000	238.000000	105.000000	210000.000000	12888.030000	21024.810000	8237.820000	11073.570000	6600.000000	6600.000000	...	50003.880000	20385.720000	138831.630000	91778.730000	438329.220000	24650.010000	681462.900000	397884.300000	2.203474e+07	1.000000

8 rows × 371 columns

## Null data check

No existence of NULL.

cust_df.isnull().sum()

ID                         0
var3                       0
var15                      0
imp_ent_var16_ult1         0
imp_op_var39_comer_ult1    0
                          ..
saldo_medio_var44_hace3    0
saldo_medio_var44_ult1     0
saldo_medio_var44_ult3     0
var38                      0
TARGET                     0
Length: 371, dtype: int64

The distribution of Target Label

# Target 
# happy customers have TARGET==0, unhappy custormers have TARGET==1
# About 4% are unhappy => unbalanced dataset

print(cust_df['TARGET'].value_counts())
unsatisfied_cnt = cust_df[cust_df['TARGET']== 1].TARGET.count()
total_cnt = cust_df.TARGET.count()
print('The proportion of unsatisfied is {0:.2f}'.format(unsatisfied_cnt/total_cnt))

0    73012
1     3008
Name: TARGET, dtype: int64
The proportion of unsatisfied is 0.04

Exploratory Data Analysis (EDA)

Var 3

var3 is suspected to be nationality of the customer: min of var3 = -999999

116 of -999999 would mean that the nationality of the customer is unknown (NaN).
Replace -99999 with 2.
Drop ID feature.

cust_df.loc[cust_df.var3==-999999].shape

(116, 371)

print(cust_df.var3.value_counts()) 
cust_df['var3'].replace(-999999,2,inplace=True)
cust_df.drop('ID',axis=1, inplace=True)

 2         74165
 8           138
-999999      116
 9           110
 3           108
           ...  
 231           1
 188           1
 168           1
 135           1
 87            1
Name: var3, Length: 208, dtype: int64

Var 38

Var38 is suspected to be the mortage value with the bank. If the mortage is with another bank the national average is used. See Link.

cust_df.var38.describe()

count    7.602000e+04
mean     1.172358e+05
std      1.826646e+05
min      5.163750e+03
25%      6.787061e+04
50%      1.064092e+05
75%      1.187563e+05
max      2.203474e+07
Name: var38, dtype: float64

# How is var38 looking when customer is unhappy ?
cust_df.loc[cust_df['TARGET']==1, 'var38'].describe()

count    3.008000e+03
mean     9.967828e+04
std      1.063098e+05
min      1.113663e+04
25%      5.716094e+04
50%      8.621997e+04
75%      1.173110e+05
max      3.988595e+06
Name: var38, dtype: float64

# Histogram for var 38 is not normal distributed
cust_df.var38.hist(bins=1000)

<AxesSubplot:>

cust_df.var38.map(np.log).hist(bins=1000);

# What are the most common values for var38 ?
cust_df.var38.value_counts() #the value 117310.979016 appears 14868 times in colum var38.b

117310.979016    14868
451931.220000       16
463625.160000       12
288997.440000       11
104563.800000       11
                 ...  
89665.500000         1
45876.570000         1
151505.640000        1
74548.170000         1
84278.160000         1
Name: var38, Length: 57736, dtype: int64

# Look at the distribution excluding the most common value.
cust_df.loc[~np.isclose(cust_df.var38, 117310.979016), 'var38'].map(np.log).hist(bins=100);

# Above plot suggest we split up var38 into two variables
# var38mc == 1 when var38 has the most common value and 0 otherwise
# logvar38 is log transformed feature when var38mc is 0, zero otherwise
cust_df['var38mc'] = np.isclose(cust_df.var38, 117310.979016)
cust_df['logvar38'] = cust_df.loc[~cust_df['var38mc'], 'var38'].map(np.log)
cust_df.loc[cust_df['var38mc'], 'logvar38'] = 0

#Check for nan's
print('Number of nan in var38mc', cust_df['var38mc'].isnull().sum())
print('Number of nan in logvar38',cust_df['logvar38'].isnull().sum())

Number of nan in var38mc 0
Number of nan in logvar38 0

Var 15

The most important feature for XGBoost is var15. According to a Kaggle form post var15 is the age of the customer.

# Let's look at the density of the age of happy/unhappy customers
sns.FacetGrid(cust_df, hue="TARGET", size=6) \
   .map(sns.kdeplot, "var15") \
   .add_legend()
plt.title('Unhappy customers are slightly older');

saldo_var30

cust_df.saldo_var30.hist(bins=100)
plt.xlim(0, cust_df.saldo_var30.max());

var 36

cust_df['var36'].value_counts()

99    30064
3     22177
1     14664
2      8704
0       411
Name: var36, dtype: int64

sns.FacetGrid(cust_df, hue="TARGET", size=6) \
   .map(sns.kdeplot, "var36") \
   .add_legend()
plt.title('If var36 is 0,1,2 or 3 => less unhappy customers');

In above plot we see that the density of unhappy custormers is lower when var36 is not 99.

Predictive Modeling

X_features = cust_df.drop("TARGET", axis=1)
y_labels = cust_df.loc[:,"TARGET"]
print('Features shape:{0}'.format(X_features.shape))

Features shape:(76020, 371)

X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels, test_size=0.2, random_state=0)
train_cnt = y_train.count()
test_cnt = y_test.count()
print('Proportion of each label in the train set')
print(y_train.value_counts()/train_cnt)
print('Proportion of each label in the test set')
print(y_test.value_counts()/test_cnt)

Proportion of each label in the train set
0    0.960964
1    0.039036
Name: TARGET, dtype: float64
Proportion of each label in the test set
0    0.9583
1    0.0417
Name: TARGET, dtype: float64

XGBoost

xgb_clf = XGBClassifier(n_estimators=500, random_state=156)


evals = [(X_train, y_train), (X_test,y_test)]
xgb_clf.fit(X_train, y_train, early_stopping_rounds=100,
           eval_metric="auc", eval_set=evals, verbose = False)
xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

ROC AUC: 0.8395

XGBoost Hyperparameter tuning

columns 많으므로, overfitting 가능성 높음
max_depth, min_child_weight, colsample_bytree 하이퍼파라미터만 일차 튜닝 대상
첫 번째에 2-3개 정도의 파라미터를 결합해 최적 파라미터를 찾아낸 뒤에 이 최적 파라미터를 기반으로 다시 1-2개 파라미터를 결합해 파라미터 튜닝을 수행함.
수행 시간이 오래걸리므로, n_estimators=100, early_stopping_rounds=30으로 줄여서 테스트한 뒤 나중에 하이퍼 파라미터 튜닝이 완료되면 다시 증가시킴.
max_depth[default=6] :0으로 지정하면 깊이에 제한이 없음. Max_depth가 높으면 overfitting 높아짐. 보통은 3-10
min_child_weight[default=1] : 트리에서 추가적으로 가지를 나눌지를 결정하기 위해 필요한 데이터의 weight 총합. 클수록 분할을 자제함. overfitting조절을 위해
colsample_bytree[default=1] : GBM의 max_feautres와 유사. 트리 생성에 필요한 피쳐를 임의로 샘플링하는데 사용됨. 매우 많은 피처가 있는 경우 overfitting 조정하는 데 사용됨.

from sklearn.model_selection import GridSearchCV

# n_estimators=100 instead of 500
xgb_clf = XGBClassifier(n_estimators=100)

params = {'max_depth':[5, 7] , 'min_child_weight':[1,3] ,'colsample_bytree':[0.5, 0.75] }

gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc",
           eval_set=[(X_train, y_train), (X_test, y_test)], verbose=False)

print('Best parameters of GridSearchCV:',gridcv.best_params_) 

xgb_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

Best parameters of GridSearchCV: {'colsample_bytree': 0.75, 'max_depth': 5, 'min_child_weight': 1}
ROC AUC: 0.8420

# n_estimators = 1000, learning_rate=0.02, reg_alpha=0.03. 
xgb_clf = XGBClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth= 5,\
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)

# early_stopping_rounds=200
xgb_clf.fit(X_train, y_train, early_stopping_rounds=200, 
            eval_metric="auc",eval_set=[(X_train, y_train), (X_test, y_test)], verbose = False)

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

ROC AUC: 0.8446

from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(xgb_clf, ax=ax , max_num_features=20,height=0.4)

<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

LightGBM

from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(n_estimators=500)

evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals,
                verbose=False)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))

ROC AUC: 0.8410

from sklearn.model_selection import GridSearchCV

#n_estimators=200
lgbm_clf = LGBMClassifier(n_estimators=200)

params = {'num_leaves': [32, 64 ],
          'max_depth':[128, 160],
          'min_child_samples':[60, 100],
          'subsample':[0.8, 1]}


# cv = 3
gridcv = GridSearchCV(lgbm_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc",
           eval_set=[(X_train, y_train), (X_test, y_test)], verbose = False)

print('The Best parameters of GridSearchCV:', gridcv.best_params_)
lgbm_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))

The Best parameters of GridSearchCV: {'max_depth': 128, 'min_child_samples': 60, 'num_leaves': 32, 'subsample': 0.8}
ROC AUC: 0.8411

# n_estimators = 1000, learning_rate=0.02, reg_alpha=0.03. 
lgbm_clf = LGBMClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth=7,\
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)

# early_stopping_rounds=200
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=200, 
            eval_metric="auc",eval_set=[(X_train, y_train), (X_test, y_test)], verbose = False)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

ROC AUC: 0.8446

from lightgbm import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(lgbm_clf, ax=ax , max_num_features=20,height=0.4)

<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

Feature Selection

X = cust_df.drop(["TARGET"], axis=1)
y = cust_df.loc[:,"TARGET"] #data.Survived

from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif,chi2
from sklearn.preprocessing import Binarizer, scale

# First select features based on chi2 and f_classif
p = 3

X_bin = Binarizer().fit_transform(scale(X))
selectChi2 = SelectPercentile(chi2, percentile=p).fit(X_bin, y)
selectF_classif = SelectPercentile(f_classif, percentile=p).fit(X, y)

chi2_selected = selectChi2.get_support()
chi2_selected_features = [ f for i,f in enumerate(X.columns) if chi2_selected[i]]
print('Chi2 selected {} features {}.'.format(chi2_selected.sum(),
   chi2_selected_features))
f_classif_selected = selectF_classif.get_support()
f_classif_selected_features = [ f for i,f in enumerate(X.columns) if f_classif_selected[i]]
print('F_classif selected {} features {}.'.format(f_classif_selected.sum(),
   f_classif_selected_features))
selected = chi2_selected & f_classif_selected
print('Chi2 & F_classif selected {} features'.format(selected.sum()))
features = [ f for f,s in zip(X.columns, selected) if s]
print (features)

Chi2 selected 12 features ['var15', 'ind_var5', 'ind_var8_0', 'ind_var30', 'num_var5', 'num_var8_0', 'num_var30_0', 'num_var30', 'num_var42', 'saldo_var30', 'var36', 'num_meses_var5_ult3'].
F_classif selected 12 features ['var15', 'ind_var5', 'ind_var8_0', 'ind_var30', 'num_var4', 'num_var5', 'num_var8_0', 'num_var30', 'num_var35', 'num_var42', 'var36', 'num_meses_var5_ult3'].
Chi2 & F_classif selected 10 features
['var15', 'ind_var5', 'ind_var8_0', 'ind_var30', 'num_var5', 'num_var8_0', 'num_var30', 'num_var42', 'var36', 'num_meses_var5_ult3']

# Make a dataframe with the selected features and the target variable
X_sel = cust_df[features+['TARGET']]

sns.pairplot(X_sel, hue="TARGET", size=2, diag_kind="kde")

features

['var15',
 'ind_var5',
 'ind_var8_0',
 'ind_var30',
 'num_var5',
 'num_var8_0',
 'num_var30',
 'num_var42',
 'var36',
 'num_meses_var5_ult3']

Correlations

cor_mat = X_sel.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(cor_mat,linewidths=.5, ax=ax)

<AxesSubplot:>

# only important correlations and not auto-correlations
threshold = 0.8
important_corrs = (cor_mat[abs(cor_mat) > threshold][cor_mat != 1.0]) \
    .unstack().dropna().to_dict()
unique_important_corrs = pd.DataFrame(
    list(set([(tuple(sorted(key)), important_corrs[key]) \
    for key in important_corrs])), columns=['attribute pair', 'correlation'])
# sorted by absolute value
unique_important_corrs = unique_important_corrs.loc[
    abs(unique_important_corrs['correlation']).argsort()[::-1]]
unique_important_corrs

	attribute pair	correlation
2	(ind_var8_0, num_var8_0)	0.999793
1	(ind_var5, num_var5)	0.993709
8	(ind_var5, num_meses_var5_ult3)	0.908842
12	(num_meses_var5_ult3, num_var5)	0.903272
9	(num_var30, num_var42)	0.898119
0	(ind_var30, num_var42)	0.894182
5	(ind_var30, num_var30)	0.875812
6	(ind_var30, num_meses_var5_ult3)	0.869045
3	(ind_var30, ind_var5)	0.848338
11	(ind_var30, num_var5)	0.843001
7	(num_var42, num_var5)	0.839574
10	(ind_var5, num_var42)	0.832502
4	(num_meses_var5_ult3, num_var42)	0.813847

Prediction with new dataset

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_sel, y, test_size=0.2, random_state=0)
train_cnt2 = y_train2.count()
test_cnt2 = y_test2.count()

from sklearn.model_selection import GridSearchCV

# n_estimators=100 instead of 500
xgb_clf = XGBClassifier(n_estimators=100)

params = {'max_depth':[5, 7] , 'min_child_weight':[1,3] ,'colsample_bytree':[0.5, 0.75] }

gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=3)
gridcv.fit(X_train2, y_train2, early_stopping_rounds=30, eval_metric="auc",
           eval_set=[(X_train2, y_train2), (X_test2, y_test2)], verbose=False)

print('Best parameters of GridSearchCV:',gridcv.best_params_) 

xgb_roc_score = roc_auc_score(y_test2, gridcv.predict_proba(X_test2)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

Best parameters of GridSearchCV: {'colsample_bytree': 0.5, 'max_depth': 5, 'min_child_weight': 1}
ROC AUC: 1.0000

# n_estimators = 1000, learning_rate=0.02, reg_alpha=0.03. 
xgb_clf = XGBClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth=7,\
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)

# early_stopping_rounds=200
xgb_clf.fit(X_train2, y_train2, early_stopping_rounds=200, 
            eval_metric="auc",eval_set=[(X_train2, y_train2), (X_test2, y_test2)], verbose = False)

xgb_roc_score = roc_auc_score(y_test2, xgb_clf.predict_proba(X_test2)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))

ROC AUC: 1.0000

from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(xgb_clf, ax=ax , max_num_features=20,height=0.4)

<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

Share on

Twitter Facebook LinkedIn