Kaggle- Santander Customer Satisfaction

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don’t stick around. What’s more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer’s happiness before it’s too late.

  • Goal: We predict if a customer is satisfied or dissatisfied with their banking experience.

  • Metric: Area under the ROC curve between the predicted probability and the observed target.

  • Model : XGBoost and LightGBM

  • Data

    • Target: It equals one for unsatisfied customers and 0 for satisfied customers.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings('ignore')

Dataset

cust_df = pd.read_csv("input/santander-customer-satisfaction/train.csv", encoding='latin-1')
print('dataset shape:', cust_df.shape)
cust_df.head(3)
#target : class label, 1=dissatisfaction, 0=satisfaction.
# # of features: 371
dataset shape: (76020, 371)
ID var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 ... saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38 TARGET
0 1 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 39205.17 0
1 3 2 34 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 49278.03 0
2 4 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 67333.77 0

3 rows × 371 columns

cust_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76020 entries, 0 to 76019
Columns: 371 entries, ID to TARGET
dtypes: float64(111), int64(260)
memory usage: 215.2 MB
  • 111 float features, 260 integer features, all features are numerical.
# The distribution of features

cust_df.describe()
ID var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 ... saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38 TARGET
count 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 ... 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 7.602000e+04 76020.000000
mean 75964.050723 -1523.199277 33.212865 86.208265 72.363067 119.529632 3.559130 6.472698 0.412946 0.567352 ... 7.935824 1.365146 12.215580 8.784074 31.505324 1.858575 76.026165 56.614351 1.172358e+05 0.039569
std 43781.947379 39033.462364 12.956486 1614.757313 339.315831 546.266294 93.155749 153.737066 30.604864 36.513513 ... 455.887218 113.959637 783.207399 538.439211 2013.125393 147.786584 4040.337842 2852.579397 1.826646e+05 0.194945
min 1.000000 -999999.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.163750e+03 0.000000
25% 38104.750000 2.000000 23.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.787061e+04 0.000000
50% 76043.000000 2.000000 28.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.064092e+05 0.000000
75% 113748.750000 2.000000 40.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.187563e+05 0.000000
max 151838.000000 238.000000 105.000000 210000.000000 12888.030000 21024.810000 8237.820000 11073.570000 6600.000000 6600.000000 ... 50003.880000 20385.720000 138831.630000 91778.730000 438329.220000 24650.010000 681462.900000 397884.300000 2.203474e+07 1.000000

8 rows × 371 columns

## Null data check

No existence of NULL.

cust_df.isnull().sum()
ID                         0
var3                       0
var15                      0
imp_ent_var16_ult1         0
imp_op_var39_comer_ult1    0
                          ..
saldo_medio_var44_hace3    0
saldo_medio_var44_ult1     0
saldo_medio_var44_ult3     0
var38                      0
TARGET                     0
Length: 371, dtype: int64

The distribution of Target Label

# Target 
# happy customers have TARGET==0, unhappy custormers have TARGET==1
# About 4% are unhappy => unbalanced dataset

print(cust_df['TARGET'].value_counts())
unsatisfied_cnt = cust_df[cust_df['TARGET']== 1].TARGET.count()
total_cnt = cust_df.TARGET.count()
print('The proportion of unsatisfied is {0:.2f}'.format(unsatisfied_cnt/total_cnt))
0    73012
1     3008
Name: TARGET, dtype: int64
The proportion of unsatisfied is 0.04

Exploratory Data Analysis (EDA)

Var 3

var3 is suspected to be nationality of the customer: min of var3 = -999999

  • 116 of -999999 would mean that the nationality of the customer is unknown (NaN).

  • Replace -99999 with 2.

  • Drop ID feature.

cust_df.loc[cust_df.var3==-999999].shape
(116, 371)
print(cust_df.var3.value_counts()) 
cust_df['var3'].replace(-999999,2,inplace=True)
cust_df.drop('ID',axis=1, inplace=True)
 2         74165
 8           138
-999999      116
 9           110
 3           108
           ...  
 231           1
 188           1
 168           1
 135           1
 87            1
Name: var3, Length: 208, dtype: int64

Var 38

Var38 is suspected to be the mortage value with the bank. If the mortage is with another bank the national average is used. See Link.

cust_df.var38.describe()
count    7.602000e+04
mean     1.172358e+05
std      1.826646e+05
min      5.163750e+03
25%      6.787061e+04
50%      1.064092e+05
75%      1.187563e+05
max      2.203474e+07
Name: var38, dtype: float64
# How is var38 looking when customer is unhappy ?
cust_df.loc[cust_df['TARGET']==1, 'var38'].describe()
count    3.008000e+03
mean     9.967828e+04
std      1.063098e+05
min      1.113663e+04
25%      5.716094e+04
50%      8.621997e+04
75%      1.173110e+05
max      3.988595e+06
Name: var38, dtype: float64
# Histogram for var 38 is not normal distributed
cust_df.var38.hist(bins=1000)
<AxesSubplot:>

cust_df.var38.map(np.log).hist(bins=1000);

# What are the most common values for var38 ?
cust_df.var38.value_counts() #the value 117310.979016 appears 14868 times in colum var38.b
117310.979016    14868
451931.220000       16
463625.160000       12
288997.440000       11
104563.800000       11
                 ...  
89665.500000         1
45876.570000         1
151505.640000        1
74548.170000         1
84278.160000         1
Name: var38, Length: 57736, dtype: int64
# Look at the distribution excluding the most common value.
cust_df.loc[~np.isclose(cust_df.var38, 117310.979016), 'var38'].map(np.log).hist(bins=100);

# Above plot suggest we split up var38 into two variables
# var38mc == 1 when var38 has the most common value and 0 otherwise
# logvar38 is log transformed feature when var38mc is 0, zero otherwise
cust_df['var38mc'] = np.isclose(cust_df.var38, 117310.979016)
cust_df['logvar38'] = cust_df.loc[~cust_df['var38mc'], 'var38'].map(np.log)
cust_df.loc[cust_df['var38mc'], 'logvar38'] = 0
#Check for nan's
print('Number of nan in var38mc', cust_df['var38mc'].isnull().sum())
print('Number of nan in logvar38',cust_df['logvar38'].isnull().sum())
Number of nan in var38mc 0
Number of nan in logvar38 0

Var 15

The most important feature for XGBoost is var15. According to a Kaggle form post var15 is the age of the customer.

# Let's look at the density of the age of happy/unhappy customers
sns.FacetGrid(cust_df, hue="TARGET", size=6) \
   .map(sns.kdeplot, "var15") \
   .add_legend()
plt.title('Unhappy customers are slightly older');

saldo_var30

cust_df.saldo_var30.hist(bins=100)
plt.xlim(0, cust_df.saldo_var30.max());

var 36

cust_df['var36'].value_counts()
99    30064
3     22177
1     14664
2      8704
0       411
Name: var36, dtype: int64
sns.FacetGrid(cust_df, hue="TARGET", size=6) \
   .map(sns.kdeplot, "var36") \
   .add_legend()
plt.title('If var36 is 0,1,2 or 3 => less unhappy customers');

In above plot we see that the density of unhappy custormers is lower when var36 is not 99.

Predictive Modeling

X_features = cust_df.drop("TARGET", axis=1)
y_labels = cust_df.loc[:,"TARGET"]
print('Features shape:{0}'.format(X_features.shape))
Features shape:(76020, 371)
X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels, test_size=0.2, random_state=0)
train_cnt = y_train.count()
test_cnt = y_test.count()
print('Proportion of each label in the train set')
print(y_train.value_counts()/train_cnt)
print('Proportion of each label in the test set')
print(y_test.value_counts()/test_cnt)
Proportion of each label in the train set
0    0.960964
1    0.039036
Name: TARGET, dtype: float64
Proportion of each label in the test set
0    0.9583
1    0.0417
Name: TARGET, dtype: float64

XGBoost

xgb_clf = XGBClassifier(n_estimators=500, random_state=156)


evals = [(X_train, y_train), (X_test,y_test)]
xgb_clf.fit(X_train, y_train, early_stopping_rounds=100,
           eval_metric="auc", eval_set=evals, verbose = False)
xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
ROC AUC: 0.8395

XGBoost Hyperparameter tuning

  • columns 많으므로, overfitting 가능성 높음

  • max_depth, min_child_weight, colsample_bytree 하이퍼파라미터만 일차 튜닝 대상

  • 첫 번째에 2-3개 정도의 파라미터를 결합해 최적 파라미터를 찾아낸 뒤에 이 최적 파라미터를 기반으로 다시 1-2개 파라미터를 결합해 파라미터 튜닝을 수행함.

  • 수행 시간이 오래걸리므로, n_estimators=100, early_stopping_rounds=30으로 줄여서 테스트한 뒤 나중에 하이퍼 파라미터 튜닝이 완료되면 다시 증가시킴.

  • max_depth[default=6] :0으로 지정하면 깊이에 제한이 없음. Max_depth가 높으면 overfitting 높아짐. 보통은 3-10

  • min_child_weight[default=1] : 트리에서 추가적으로 가지를 나눌지를 결정하기 위해 필요한 데이터의 weight 총합. 클수록 분할을 자제함. overfitting조절을 위해

  • colsample_bytree[default=1] : GBM의 max_feautres와 유사. 트리 생성에 필요한 피쳐를 임의로 샘플링하는데 사용됨. 매우 많은 피처가 있는 경우 overfitting 조정하는 데 사용됨.

from sklearn.model_selection import GridSearchCV

# n_estimators=100 instead of 500
xgb_clf = XGBClassifier(n_estimators=100)

params = {'max_depth':[5, 7] , 'min_child_weight':[1,3] ,'colsample_bytree':[0.5, 0.75] }

gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc",
           eval_set=[(X_train, y_train), (X_test, y_test)], verbose=False)

print('Best parameters of GridSearchCV:',gridcv.best_params_) 

xgb_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
Best parameters of GridSearchCV: {'colsample_bytree': 0.75, 'max_depth': 5, 'min_child_weight': 1}
ROC AUC: 0.8420
# n_estimators = 1000, learning_rate=0.02, reg_alpha=0.03. 
xgb_clf = XGBClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth= 5,\
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)

# early_stopping_rounds=200
xgb_clf.fit(X_train, y_train, early_stopping_rounds=200, 
            eval_metric="auc",eval_set=[(X_train, y_train), (X_test, y_test)], verbose = False)

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
ROC AUC: 0.8446
from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(xgb_clf, ax=ax , max_num_features=20,height=0.4)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

LightGBM

from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(n_estimators=500)

evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals,
                verbose=False)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))
ROC AUC: 0.8410
from sklearn.model_selection import GridSearchCV

#n_estimators=200
lgbm_clf = LGBMClassifier(n_estimators=200)

params = {'num_leaves': [32, 64 ],
          'max_depth':[128, 160],
          'min_child_samples':[60, 100],
          'subsample':[0.8, 1]}


# cv = 3
gridcv = GridSearchCV(lgbm_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc",
           eval_set=[(X_train, y_train), (X_test, y_test)], verbose = False)

print('The Best parameters of GridSearchCV:', gridcv.best_params_)
lgbm_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))
The Best parameters of GridSearchCV: {'max_depth': 128, 'min_child_samples': 60, 'num_leaves': 32, 'subsample': 0.8}
ROC AUC: 0.8411
# n_estimators = 1000, learning_rate=0.02, reg_alpha=0.03. 
lgbm_clf = LGBMClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth=7,\
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)

# early_stopping_rounds=200
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=200, 
            eval_metric="auc",eval_set=[(X_train, y_train), (X_test, y_test)], verbose = False)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
ROC AUC: 0.8446
from lightgbm import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(lgbm_clf, ax=ax , max_num_features=20,height=0.4)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

Feature Selection

X = cust_df.drop(["TARGET"], axis=1)
y = cust_df.loc[:,"TARGET"] #data.Survived
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import f_classif,chi2
from sklearn.preprocessing import Binarizer, scale

# First select features based on chi2 and f_classif
p = 3

X_bin = Binarizer().fit_transform(scale(X))
selectChi2 = SelectPercentile(chi2, percentile=p).fit(X_bin, y)
selectF_classif = SelectPercentile(f_classif, percentile=p).fit(X, y)

chi2_selected = selectChi2.get_support()
chi2_selected_features = [ f for i,f in enumerate(X.columns) if chi2_selected[i]]
print('Chi2 selected {} features {}.'.format(chi2_selected.sum(),
   chi2_selected_features))
f_classif_selected = selectF_classif.get_support()
f_classif_selected_features = [ f for i,f in enumerate(X.columns) if f_classif_selected[i]]
print('F_classif selected {} features {}.'.format(f_classif_selected.sum(),
   f_classif_selected_features))
selected = chi2_selected & f_classif_selected
print('Chi2 & F_classif selected {} features'.format(selected.sum()))
features = [ f for f,s in zip(X.columns, selected) if s]
print (features)
Chi2 selected 12 features ['var15', 'ind_var5', 'ind_var8_0', 'ind_var30', 'num_var5', 'num_var8_0', 'num_var30_0', 'num_var30', 'num_var42', 'saldo_var30', 'var36', 'num_meses_var5_ult3'].
F_classif selected 12 features ['var15', 'ind_var5', 'ind_var8_0', 'ind_var30', 'num_var4', 'num_var5', 'num_var8_0', 'num_var30', 'num_var35', 'num_var42', 'var36', 'num_meses_var5_ult3'].
Chi2 & F_classif selected 10 features
['var15', 'ind_var5', 'ind_var8_0', 'ind_var30', 'num_var5', 'num_var8_0', 'num_var30', 'num_var42', 'var36', 'num_meses_var5_ult3']
# Make a dataframe with the selected features and the target variable
X_sel = cust_df[features+['TARGET']]
sns.pairplot(X_sel, hue="TARGET", size=2, diag_kind="kde")

features
['var15',
 'ind_var5',
 'ind_var8_0',
 'ind_var30',
 'num_var5',
 'num_var8_0',
 'num_var30',
 'num_var42',
 'var36',
 'num_meses_var5_ult3']

Correlations

cor_mat = X_sel.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(cor_mat,linewidths=.5, ax=ax)
<AxesSubplot:>

# only important correlations and not auto-correlations
threshold = 0.8
important_corrs = (cor_mat[abs(cor_mat) > threshold][cor_mat != 1.0]) \
    .unstack().dropna().to_dict()
unique_important_corrs = pd.DataFrame(
    list(set([(tuple(sorted(key)), important_corrs[key]) \
    for key in important_corrs])), columns=['attribute pair', 'correlation'])
# sorted by absolute value
unique_important_corrs = unique_important_corrs.loc[
    abs(unique_important_corrs['correlation']).argsort()[::-1]]
unique_important_corrs
attribute pair correlation
2 (ind_var8_0, num_var8_0) 0.999793
1 (ind_var5, num_var5) 0.993709
8 (ind_var5, num_meses_var5_ult3) 0.908842
12 (num_meses_var5_ult3, num_var5) 0.903272
9 (num_var30, num_var42) 0.898119
0 (ind_var30, num_var42) 0.894182
5 (ind_var30, num_var30) 0.875812
6 (ind_var30, num_meses_var5_ult3) 0.869045
3 (ind_var30, ind_var5) 0.848338
11 (ind_var30, num_var5) 0.843001
7 (num_var42, num_var5) 0.839574
10 (ind_var5, num_var42) 0.832502
4 (num_meses_var5_ult3, num_var42) 0.813847

Prediction with new dataset

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_sel, y, test_size=0.2, random_state=0)
train_cnt2 = y_train2.count()
test_cnt2 = y_test2.count()
from sklearn.model_selection import GridSearchCV

# n_estimators=100 instead of 500
xgb_clf = XGBClassifier(n_estimators=100)

params = {'max_depth':[5, 7] , 'min_child_weight':[1,3] ,'colsample_bytree':[0.5, 0.75] }

gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=3)
gridcv.fit(X_train2, y_train2, early_stopping_rounds=30, eval_metric="auc",
           eval_set=[(X_train2, y_train2), (X_test2, y_test2)], verbose=False)

print('Best parameters of GridSearchCV:',gridcv.best_params_) 

xgb_roc_score = roc_auc_score(y_test2, gridcv.predict_proba(X_test2)[:,1], average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
Best parameters of GridSearchCV: {'colsample_bytree': 0.5, 'max_depth': 5, 'min_child_weight': 1}
ROC AUC: 1.0000
# n_estimators = 1000, learning_rate=0.02, reg_alpha=0.03. 
xgb_clf = XGBClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth=7,\
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)

# early_stopping_rounds=200
xgb_clf.fit(X_train2, y_train2, early_stopping_rounds=200, 
            eval_metric="auc",eval_set=[(X_train2, y_train2), (X_test2, y_test2)], verbose = False)

xgb_roc_score = roc_auc_score(y_test2, xgb_clf.predict_proba(X_test2)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
ROC AUC: 1.0000
from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(xgb_clf, ax=ax , max_num_features=20,height=0.4)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

Leave a comment