Kaggle-Titanic Data

  • Goal: We use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

  • Metric: The percentage of passengers you correctly predict, known as accuracy.

  • Data Site

Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree

from sklearn.model_selection import train_test_split #training and testing data split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix
from collections import Counter



plt.style.use('seaborn')
sns.set(font_scale=1)

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
df_train = pd.read_csv('input/train.csv')
df_test = pd.read_csv('input/test.csv')
df_train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df_train.info()
df_test.info()
print(df_train.describe())
print(df_test.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
       PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

There exist Null data (The count of PassengerID is different from that of Age, Cabin, Embarked (Age, Cabin, Fare) in the training set (the test set).)

Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

## Null data check

#df_train = df_train.fillna(np.nan)
df_train.isnull().sum() #checking for total null values
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
#df_test = df_test.fillna(np.nan)
df_test.isnull().sum()
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Age, Cabin and Embarked features have null values.

for col in df_train.columns:
    msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'. format(col, 100*(df_train[col].isnull().sum() / df_train[col].shape[0]))
    print(msg)
column: PassengerId	 Percent of NaN value: 0.00%
column:   Survived	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 19.87%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.00%
column:      Cabin	 Percent of NaN value: 77.10%
column:   Embarked	 Percent of NaN value: 0.22%
for col in df_test.columns:
    msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'.format(col, 100*(df_test[col].isnull().sum() / df_test[col].shape[0]))
    print(msg)
column: PassengerId	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 20.57%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.24%
column:      Cabin	 Percent of NaN value: 78.23%
column:   Embarked	 Percent of NaN value: 0.00%

## Outlier detection

  • I used the Tukey method (Tukey JW., 1977) to detect outliers.
def get_outliers(df,n,columns):
    
    outlier_indices = []
    
    for col in columns:
        q1 = np.percentile(df[col],25)
        q3 = np.percentile(df[col],75)
        iqr = q3 - q1
        iqr_weight = iqr*1.5
        
        outlier_lists = df[(df[col] < q1 - iqr_weight) | (df[col] > q3 + iqr_weight)].index
        outlier_indices.extend(outlier_lists)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)  #e.g. # Counter({'abc': 2, 'bcd': 1, 'cde': 1})
    multiple_outliers = list( li for li, v in outlier_indices.items() if v > n ) #li=list, v=number
        
    return multiple_outliers   
# detect outliers from numerical features: Age, SibSp , Parch and Fare
outliers_final = get_outliers(df_train,2,["Age","SibSp","Parch","Fare"]) # Outliers as subjects that have at least three outlied values. 
#[27, 88, 159, 180, 201, 324, 341, 792, 846, 863]
df_train.loc[outliers_final] #Show the outlied subjects
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.00 C23 C25 C27 S
88 89 1 1 Fortune, Miss. Mabel Helen female 23.0 3 2 19950 263.00 C23 C25 C27 S
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 NaN S
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 NaN S
201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 NaN S
341 342 1 1 Fortune, Miss. Alice Elizabeth female 24.0 3 2 19950 263.00 C23 C25 C27 S
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 NaN S
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 NaN S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.55 NaN S
#Drop outliers
df_train = df_train.drop(outliers_final, axis=0).reset_index(drop=True)

The distribution of Target Label

f, ax = plt.subplots(ncols=2, figsize=(8,4))
f.tight_layout()

df_train['Survived'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')
sns.countplot('Survived', data=df_train, ax=ax[1])
ax[1].set_title('Count plot - Survived')

plt.show()

  • It is evident that not many passengers survived the accident.

  • Out of 891 passengers in training set, only around 350 survived i.e Only 38.4% of the total training set survived the crash.

  • The distribution of target label is approximately balanced.

Exploratory Data Analysis (EDA)

1) Analysis of the features.

2) Finding any relations or trends considering multiple features.

Types of Features

  • Features: Categorical Features/ Ordinal Features/ Continuous Feature

    • Pclass: 1 = 1st, 2 = 2nd, 3 = 3rd/ Ordinal Features

    • sex: male, female/ Categorical Features

    • Age: Continous Features

    • sibSp: # of siblings / spouses aboard the Titanic/ Discrete Feature

    • parch: # of parents / children aboard the Titanic/ Discrete Feature

    • ticket: Ticket number/ alphabat + integer string

    • fare: Passenger fare/ Continous Features

    • cabin: Cabin number/ alphabat + integer string

    • Embarked: C = Cherbourg, Q = Queenstown, S = Southampton string/ Categorical Features

Analysing The Features

Pclass

  • Ordinal data

  • Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd

df_train[['Pclass','Survived']].groupby(['Pclass']).count()
#, as_index=True
Survived
Pclass
1 213
2 184
3 484
#Only the count of survived=1 for each Pclass
#as_index=True : default
df_train[['Pclass','Survived']].groupby(['Pclass']).sum()
Survived
Pclass
1 134
2 87
3 119
#combine the previous tables
# all : margins=True
pd.crosstab(df_train.Pclass, df_train.Survived, margins=True).style.background_gradient(cmap='summer_r')
Survived 0 1 All
Pclass      
1 79 134 213
2 97 87 184
3 365 119 484
All 541 340 881
df_train[['Pclass', 'Survived']].groupby(['Pclass']).mean().sort_values(by='Survived', ascending=False).plot.bar()
# The better Pclass is, the higher survival rate is.
<AxesSubplot:xlabel='Pclass'>

# Countplot in Seaborn
f, ax = plt.subplots(1,2, figsize=(18,8))
df_train['Pclass'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'], ax=ax[0])
ax[0].set_title('Number of Passengers By Pclass')
ax[0].set_ylabel('count')
sns.countplot('Pclass',hue='Survived',data=df_train, ax=ax[1])
ax[1].set_title('Pclass: Survived vs Dead')
plt.show()

  • Passenegers Of Pclass 1 were given a very high priority while they were rescued. Even though the number of Passengers in Pclass 3 is significantly higher, still the number of survival from them is very low, somewhere around 25%.

  • For Pclass 1 survived is around 63% while for Pclass2 is around 48%.

  • We conclude that Pclass affects the survival of people, y target.

  • We are going to take Pclass as a feature.

Sex

  • Categorical Feature/ Binary string
df_train[['Sex','Survived']].groupby(['Sex']).mean().sort_values(by='Survived', ascending=False)
Survived
Sex
female 0.747573
male 0.190559
pd.crosstab(df_train['Sex'], df_train['Survived'], margins=True).style.background_gradient(cmap='summer_r')
Survived 0 1 All
Sex      
female 78 231 309
male 463 109 572
All 541 340 881
f, ax = plt.subplots(1,2, figsize=(18,8))
df_train[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
#sns.barplot(x ='Sex', y = 'Survived', data = df_train)
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=df_train,ax=ax[1])
ax[1].set_title('Sex: Survived vs Dead')
plt.show()

  • The number of men on the ship is much more than the number of women. BUT, the number of women saved is almost twice the number of males saved. The survival rate for women on the ship is around 75% while those for men around 18-19%.

  • Sex also may play an important role in the prediction of the survivial.

Both Sex and Pclass

  • Let’s look into survival rate with Sex and Pclass together.
#factorplot in seaborn
sns.factorplot('Pclass','Survived',hue='Sex', data=df_train, size=4, aspect=1.5, kind="bar")
<seaborn.axisgrid.FacetGrid at 0x7fc1e04ae130>

  • we can easily infer that survival for Women from Pclass1 is about 95-96%, as only 3 out of 94 Women from Pclass1 died.

  • It is evident that irrespective of Pclass, Women were given first priority while rescue. Even Men from Pclass1 have a very low survival rate.

  • In all classes, the probability of the survival of women would be higher than that of men.

  • As a person is in higher class, the survival of a person is higher than that in other classes.

sns.factorplot(x='Sex', y='Survived', col='Pclass',
              data=df_train, satureation=1,
               size=4.5 , aspect=1)
plt.show()

Age

  • continuous integer
print('The oldest passenger: {:.1f} Years'.format(df_train['Age'].max()))
print('The youngest passenger: {:.1f} Years'.format(df_train['Age'].min()))
print('The mean age of the passengers: {:.1f} Years'.format(df_train['Age'].mean()))
The oldest passenger: 80.0 Years
The youngest passenger: 0.4 Years
The mean age of the passengers: 29.7 Years
fig, ax = plt.subplots(1,1,figsize=(9,5))
sns.distplot(df_train[df_train['Survived'] == 1]['Age'], ax=ax)
sns.distplot(df_train[df_train['Survived'] == 0]['Age'], ax=ax)
plt.legend(['Survived == 1', 'Survived == 0'])
plt.show()

#kdeplot

  • There is a peak corresponding to young passengers between 20-30.

  • It seems that very young passengers between 0-5 years have more chance to survive (Clearly there exists a peak).

#Age distribution within classes
plt.figure(figsize=(8,6))
df_train['Age'][df_train['Pclass'] == 1].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 2].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 3].plot(kind='kde')

plt.xlabel('Age')
plt.title('Age distribution within Classes')
plt.legend(['1st Class','2nd Class','3rd Class'])
<matplotlib.legend.Legend at 0x7fc1e04961f0>

  • The higher a class is, the higher the proportion of old people is.
cummulate_survival_ratio = []
for i in range(1, 80):
    cummulate_survival_ratio.append(df_train[df_train['Age'] < i]['Survived'].sum() / len(df_train[df_train['Age'] < i]['Survived']))
    
plt.figure(figsize=(7, 7))
plt.plot(cummulate_survival_ratio)
plt.title('Survival rate change depending on range of Age', y=1.02)
plt.ylabel('Survival rate')
plt.xlabel('Range of Age(0~x)')
plt.show()

Pclass, Sex, Age

# violinplot in Seaborn : Sex, Pclass, Age, Survived
#x axis : case (Pclass, Sex)
#y axis : distribution (Age)

f, ax = plt.subplots(ncols=2, figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=df_train, scale="count", split=True, ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))

sns.violinplot("Sex","Age", hue="Survived", data=df_train, scale='count', split=True, ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

  • Women and young people were survived more than the others.

1)The number of children increases with Pclass and the survival rate for passenegers below Age 10(i.e children) looks to be good irrespective of the Pclass.

2)Survival chances for Passenegers aged 20-50 from Pclass1 is high and is even better for Women.

3)For males, the survival chances decreases with an increase in age.

  • the Age feature has 177 null values. To replace these NaN values, we can assign them the mean age of the dataset.

  • But the problem is, there were many people with many different ages.

  • We will check the NAME feature.

df_train['Name'].head()
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
df_train['Initial']=0
for i in df_train:
    df_train['Initial']=df_train.Name.str.extract('([A-Za-z]+)\.') 
    #\ = whitespace. [A-Za-z]: Strings which lie between A-Z or a-z /followed by a .(dot)
    #https://blog.naver.com/good5229/221889604699
pd.crosstab(df_train.Initial,df_train.Sex).T.style.background_gradient(cmap='summer_r') #Checking the Initials with the Sex
Initial Capt Col Countess Don Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir
Sex                                  
female 0 0 1 0 1 0 1 0 0 177 2 1 0 125 1 0 0
male 1 2 0 1 6 1 0 2 39 0 0 0 513 0 0 6 1
  • There are some misspelled Initials like Mlle or Mme that stand for Miss. I will replace them with Miss and same thing for other values.
df_train['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
df_train.groupby('Initial')['Age'].mean() #lets check the average age by Initials
Initial
Master     4.574167
Miss      21.837838
Mr        32.773284
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64
g =sns.factorplot(x="Initial",y="Survived",data=df_train,kind="bar", size=4)
g.set_xticklabels(["Master","Miss","Mr","Mrs", "Other"])
#g = g.set_ylabels("survival probability")
<seaborn.axisgrid.FacetGrid at 0x7fc1f044b9d0>

  • Women and children first.
## Assigning the NaN Values with the Ceil values of the mean ages
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mr'),'Age']=33
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mrs'),'Age']=36
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Master'),'Age']=5
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Miss'),'Age']=22
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Other'),'Age']=46 
df_train.Age.isnull().sum()
df_train.Age.isnull().any() 
False
f, ax = plt.subplots(ncols=2, figsize=(18,8))
sns.distplot(df_train[df_train.Survived==0].Age, color='r', ax=ax[0])
ax[0].set_title('Survived=0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)

sns.distplot(df_train[df_train.Survived==1].Age, color='g', ax=ax[1])
ax[1].set_title('Survived=1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
[<matplotlib.axis.XTick at 0x7fc2213cd160>,
 <matplotlib.axis.XTick at 0x7fc2213cd130>,
 <matplotlib.axis.XTick at 0x7fc208aefc40>,
 <matplotlib.axis.XTick at 0x7fc2014f9a30>,
 <matplotlib.axis.XTick at 0x7fc2015031c0>,
 <matplotlib.axis.XTick at 0x7fc201503910>,
 <matplotlib.axis.XTick at 0x7fc201503bb0>,
 <matplotlib.axis.XTick at 0x7fc2213c8fd0>,
 <matplotlib.axis.XTick at 0x7fc2213e5a60>,
 <matplotlib.axis.XTick at 0x7fc2213ee1f0>,
 <matplotlib.axis.XTick at 0x7fc2213ee940>,
 <matplotlib.axis.XTick at 0x7fc2213f50d0>,
 <matplotlib.axis.XTick at 0x7fc2213eed30>,
 <matplotlib.axis.XTick at 0x7fc2213e5f40>,
 <matplotlib.axis.XTick at 0x7fc2213df100>,
 <matplotlib.axis.XTick at 0x7fc2213f5a60>,
 <matplotlib.axis.XTick at 0x7fc2213fb1f0>]

f,ax=plt.subplots(1,2,figsize=(20,10))
df_train[df_train['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
df_train[df_train['Survived']==1].Age.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

Observations:

1)The Toddlers(age<5) were saved in large numbers(The Women and Child First Policy).

2)The oldest Passenger was saved(80 years).

3)Maximum number of deaths were in the age group of 30-40.

sns.factorplot('Pclass','Survived',col='Initial',data=df_train)
plt.show()

The Women and Child first policy thus holds true irrespective of the class.

Embarked

  • Categorical Value
f, ax = plt.subplots(1,1, figsize=(7,7))
df_train[['Embarked','Survived']].groupby(['Embarked']).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax)
<AxesSubplot:xlabel='Embarked'>

f,ax=plt.subplots(2, 2, figsize=(20,15))
sns.countplot('Embarked', data=df_train, ax=ax[0,0])
ax[0,0].set_title('(1) No. Of Passengers Boarded')
sns.countplot('Embarked', hue='Sex', data=df_train, ax=ax[0,1])
ax[0,1].set_title('(2) Male-Female Split for Embarked')
sns.countplot('Embarked', hue='Survived', data=df_train, ax=ax[1,0])
ax[1,0].set_title('(3) Embarked vs Survived')
sns.countplot('Embarked', hue='Pclass', data=df_train, ax=ax[1,1])
ax[1,1].set_title('(4) Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()
# The count of S has the highest value.
# The reason why the survival was low in S is the number of people in the 3rd class.
#The chances for survival for Port C is highest around 0.55 while it is lowest for S.

  • Maximum passenegers boarded from S. Majority of them being from Pclass3.

  • The Passengers coming from Cherboug (C) have more chance to survive. The reason maybe the rescue of all the Pclass1 and Pclass2 Passengers.

  • The Embark S (Southampton) looks to the port from where majority of the rich people boarded. Still the chances for survival is low here, that is because many passengers from Pclass3 around 81% didn’t survive.

  • Port Q had almost 95% of the passengers were from Pclass3.

sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=df_train)
plt.show()

  • Indeed, the third class is the most frequent for passenger coming from Southampton (S) and Queenstown (Q), whereas Cherbourg passengers are mostly in first class which have the highest survival rate.

Filling Embarked NaN

As we saw that maximum passengers boarded from Port S, we replace NaN with S.

df_train['Embarked'].fillna('S',inplace=True)
df_train.Embarked.isnull().sum()
0

Family- SibSp + Parch

  • discrete Feature

This feature represents whether a person is alone or with his family members.

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife

  • sibSp : quantitative integer

  • parch : quantitative integer

pd.crosstab(df_train.SibSp,df_train.Survived).style.background_gradient(cmap='summer_r')
Survived 0 1
SibSp    
0 398 210
1 97 112
2 15 13
3 11 2
4 15 3
5 5 0
f,ax=plt.subplots(ncols=2, figsize=(14,7))
sns.barplot('SibSp','Survived',data=df_train,ax=ax[0])
ax[0].set_title('SibSp vs Survived')
sns.factorplot('SibSp','Survived',data=df_train,ax=ax[1])
ax[1].set_title('SibSp vs Survived')
plt.show()

pd.crosstab(df_train.SibSp,df_train.Pclass).style.background_gradient(cmap='summer_r')
Pclass 1 2 3
SibSp      
0 137 120 351
1 71 55 83
2 5 8 15
3 0 1 12
4 0 0 18
5 0 0 5

The barplot and factorplot shows that if a passenger is alone onboard with no siblings, he have 34.5% survival rate. The graph roughly decreases as the number of siblings increase. Small families have more chance to survive, more than single.

This makes sense. That is, if I have a family on board, I will try to save them instead of saving myself first. Surprisingly, the survival for families with 5-8 members is 0%. The reason may be Pclass??

The reason is Pclass. The crosstab shows that Person with SibSp>3 were all in Pclass3. It is evident that all the large families in Pclass3(>3) died.

  • Figure (1) - The family size is 1 to 11.

  • Figure (2), (3) - Survival rate depending on family size. When family consists of four people, the survival rate is the highest.

Fare

  • Continuous feature > Histogram
print('Highest Fare was:',df_train.Fare.max())
print('Lowest Fare was:',df_train.Fare.min())
print('Average Fare was:',df_train.Fare.mean())
Highest Fare was: 512.3292
Lowest Fare was: 0.0
Average Fare was: 31.121565607264436
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sns.distplot(df_train['Fare'], color='b', label='Skewness : {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

The distribution of Fare features is very right-skewed. We will transform data to improve our model (to reduce this skew) by taking log for x>0.

f,ax=plt.subplots(1,3,figsize=(20,8))
sns.distplot(df_train[df_train['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(df_train[df_train['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(df_train[df_train['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()

  • As this is also continous, we can convert into discrete values by using binning.

  • All distributions are right-skewed.

df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()
df_train['Fare'] = df_train['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sns.distplot(df_train['Fare'], color='b', label='Skewness : {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

  • After the log transformation. the skewness of data is significantly reduced.

  • It is one of feature engineering.

Cabin

  • We are not going to contain this feature because it consists of 80% NaN.

Ticket

  • We have a variety of ticket numbers.

Correlation Between The Features

df_train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Initial
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 1.981001 NaN S Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 4.266662 C85 C Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 2.070022 NaN S Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 3.972177 C123 S Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 2.085672 NaN S Mr
sns.heatmap(df_train.corr(),annot=True,cmap='RdBu',linewidths=0.2) #data.corr()-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()
#(+) correlation : Blue/ (-) correlation: Red

Now from the above heatmap,we notice that the features are not much correlated. The highest correlation is 0.41 between SibSp and Parch. So we can take all features.

Feature Engineering and Data Cleaning

  • It is not necessary that all the features will be important. There maybe be many redundant features which should be eliminated. Also we can get or add new features by observing or extracting information from other features.

Converting features into suitable form for modeling

Age: a continuous feature

If I say to group them by their Age, then how would you do it? If there are 30 Persons, there may be 30 age values. Now this is problematic.

We need to convert these continous values into categorical values by either Binning or Normalization. I will be using binning i.e group a range of ages into a single bin or assign them a single value.

So the maximum age of a passenger was 80. So lets divide the range from 0-80 into 5 bins. So 80/5=16. So bins of size 16.

data = df_train
data['Age_band']=0
data.loc[data['Age']<=16, 'Age_band']=0
data.loc[(data['Age']>16) & (data['Age']<=32), 'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Initial Age_band
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 1.981001 NaN S Mr 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 4.266662 C85 C Mrs 2
data['Age_band'].value_counts() #checking the number of passenegers in each band
1    376
2    322
0    103
3     69
4     11
Name: Age_band, dtype: int64
sns.factorplot('Age_band','Survived',data=data,col='Pclass')
plt.show()

The survival rate decreases as the age increases irrespective of the Pclass.

Fare: a continuous feature

Since fare is also a continous feature, we need to convert it into ordinal value. For this we will use pandas.qcut.

So what qcut does is it splits or arranges the values according the number of bins we have passed. So if we pass for 5 bins, it will arrange the values equally spaced into 5 seperate bins or value ranges.

data['Fare_Range']=pd.qcut(data['Fare'],4)
data.Fare_Range.value_counts()
(2.066, 2.671]     224
(-0.001, 2.066]    223
(2.671, 3.418]     217
(3.418, 6.239]     217
Name: Fare_Range, dtype: int64
data.groupby(['Fare_Range'])['Survived'].mean()
Fare_Range
(-0.001, 2.066]    0.197309
(2.066, 2.671]     0.303571
(2.671, 3.418]     0.456221
(3.418, 6.239]     0.594470
Name: Survived, dtype: float64

We can clearly see that as the fare_range increases, the chances of survival increases.

Now we cannot pass the Fare_Range values as it is. We should convert it into singleton values same as we did in Age_Band.

data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3
sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()

Clearly, as the Fare_cat increases, the survival chances increases. This feature may become an important feature during modeling along with the Sex.

Converting String Values into Numeric

We need to convert features such as Sex, Embarked, and Initial into numeric values.

#data['Sex'].replace(['male','female'],[0,1],inplace=True)
#data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
#data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)

# Label Encoding

def label_encoder(df, features):
    for f in features:
        le = LabelEncoder()
        le = le.fit(df[f])
        df[f] = le.transform(df[f])
    return df

features = ['Sex', 'Embarked', 'Initial'] #female=0, male=1 #'C', 'Q', 'S' #'Master', 'Miss', 'Mr', 'Mrs', 'Other'
data = label_encoder(data, features) 
data.head()
Survived Pclass Sex SibSp Parch Embarked Initial Age_band Fare_cat Family_Size
0 0 3 1 1 0 2 2 1 0 2
1 1 1 0 1 0 0 3 2 0 2
2 1 3 0 0 0 2 1 1 0 1
3 1 1 0 1 0 2 3 2 0 2
4 0 3 1 0 0 2 2 2 0 1

Adding any few features

Family_Size and Alone

  • This feature is the summation of Parch, SibSp and 1 (including the passenger). It gives us a combined data so that we can check if survival rate have anything to do with family size of the passengers.

  • We can imagine that large large families may have more difficulties to be evacuated, looking for theirs sisters/brothers/parents during the evacuation

  • Alone will denote whether a passenger is alone or not.

# Family is composed of sibSp, parch, and me (1).

data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']+1 #family size
f,ax=plt.subplots(1, 3, figsize=(40,10))
sns.countplot('Family_Size', data=data, ax=ax[0])
ax[0].set_title('(1) No. Of Passengers Boarded', y=1.02)

sns.countplot('Family_Size', hue='Survived', data=df_train, ax=ax[1])
ax[1].set_title('(2) Survived countplot depending on FamilySize', y=1.02)

data[['Family_Size', 'Survived']].groupby(['Family_Size'], as_index=True).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax[2])
ax[2].set_title('(3) Survived rate depending on FamilySize', y=1.02)

plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

  • Family_Size=1 means that the passeneger is alone. Clearly, if you are alone or family_size=1,then chances for survival is very low.

  • The family size seems to play an important role, survival rates are bad for large families (family size > 4).

Removing redundant features

data.drop(['Name','Age','Ticket','Fare','Cabin','Fare_Range','PassengerId'], axis=1, inplace=True)
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2,annot_kws={'size':20})
fig=plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

Now the above correlation plot, we can see some positively related features. Some of them being SibSp andd Family_Size and Parch and Family_Size and some negative ones like Alone and Family_Size.

Predictive Modeling

We have gained some insights from the EDA part. But with that, we cannot accurately predict or tell whether a passenger will survive or die. So now we will predict the whether the Passenger will survive or not using some great Classification Algorithms.Following are the algorithms I will use to make the model:

1)Logistic Regression

2)Support Vector Machines(Linear and radial)

3)Random Forest

4)K-Nearest Neighbours

5)Naive Bayes

6)Decision Tree

7)Logistic Regression

Running Basic Algorithms

train, test=train_test_split(data, test_size=0.3, random_state=121, stratify=data['Survived'])

Data is split in a stratified fashion. Each set contains approximately the same percentage of samples of each target class

train_X = train.iloc[:,1:]
train_Y = train.iloc[:,0]

test_X = test.iloc[:,1:]
test_Y = test.iloc[:,0]

X = data.iloc[:,1:]
Y = data.iloc[:,0] #data.Survived

Radial Support Vector Machines(rbf-SVM)

Tuning Hyperparameters

  • Kernel: The main function of the kernel is to transform the given dataset input data into the required form. There are various types of functions such as linear, polynomial, and radial basis function (RBF). Polynomial and RBF are useful for non-linear hyperplane. Polynomial and RBF kernels compute the separation line in the higher dimension. In some of the applications, it is suggested to use a more complex kernel to separate the classes that are curved or nonlinear. This transformation can lead to more accurate classifiers.

  • Regularization: Regularization parameter in python’s Scikit-learn C parameter used to maintain regularization. Here C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimization how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term. A smaller value of C creates a small-margin hyperplane and a larger value of C creates a larger-margin hyperplane.

  • Gamma: A lower value of Gamma will loosely fit the training dataset, whereas a higher value of gamma will exactly fit the training dataset, which causes over-fitting. In other words, you can say a low value of gamma considers only nearby points in calculating the separation line, while the a value of gamma considers all the data points in the calculation of the separation line.

model=svm.SVC(kernel='rbf',C=1,gamma=0.1) # default C=1
model.fit(train_X,train_Y)
prediction1=model.predict(test_X)
print('Accuracy for rbf SVM is ',metrics.accuracy_score(prediction1,test_Y))
Accuracy for rbf SVM is  0.8188679245283019

Linear Support Vector Machine(linear-SVM)

model=svm.SVC(kernel='linear',C=0.1)
model.fit(train_X,train_Y)
prediction2=model.predict(test_X)
print('Accuracy for linear SVM is',metrics.accuracy_score(prediction2,test_Y))
Accuracy for linear SVM is 0.7811320754716982

Logistic Regression

model = LogisticRegression()
model.fit(train_X,train_Y)
prediction3=model.predict(test_X)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction3,test_Y))
The accuracy of the Logistic Regression is 0.7584905660377359

Decision Tree

model=DecisionTreeClassifier()
model.fit(train_X,train_Y)
prediction4=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction4,test_Y))
The accuracy of the Decision Tree is 0.7924528301886793

K-Nearest Neighbours(KNN)

model=KNeighborsClassifier() 
model.fit(train_X,train_Y)
prediction5=model.predict(test_X)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction5,test_Y))
The accuracy of the KNN is 0.7849056603773585

Now the accuracy for the KNN model changes as we change the values for n_neighbours attribute. The default value is 5. Lets check the accuracies over various values of n_neighbours.

a_index=list(range(1,11)) #less than 11. 
a=pd.Series() # Creating empty Series.
x=[0,1,2,3,4,5,6,7,8,9,10]

for i in list(range(1,11)):
    model=KNeighborsClassifier(n_neighbors=i) 
    model.fit(train_X,train_Y)
    prediction=model.predict(test_X)
    a=a.append(pd.Series(metrics.accuracy_score(prediction,test_Y)))

plt.plot(a_index, a)
plt.xticks(x)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()
print('Accuracies for different values of n are:',a.values,'with the max value as ',a.values.max())

Accuracies for different values of n are: [0.76603774 0.75849057 0.76981132 0.78113208 0.78490566 0.78113208
 0.79622642 0.79245283 0.79622642 0.77735849] with the max value as  0.7962264150943397

Gaussian Naive Bayes

model=GaussianNB()
model.fit(train_X,train_Y)
prediction6=model.predict(test_X)
print('The accuracy of the NaiveBayes is',metrics.accuracy_score(prediction6,test_Y))
The accuracy of the NaiveBayes is 0.8

Random Forests

model=RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_Y)
prediction7=model.predict(test_X)
print('The accuracy of the Random Forests is',metrics.accuracy_score(prediction7,test_Y))
The accuracy of the Random Forests is 0.8150943396226416

Cross Validation

Can we confirm that it will be 90% for all the new test sets that come over??. The answer is No, because we can’t determine which all instances will the classifier will use to train itself. As the training and testing data changes, the accuracy will also change. It may increase or decrease. This is known as model variance.

To overcome this and get a generalized model, we use Cross Validation.

from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction

mean=[]
accuracy=[]
std=[]
classifiers=['Linear Svm','Radial Svm','Logistic Regression','KNN','Decision Tree','Naive Bayes','Random Forest']
models=[svm.SVC(kernel='linear'),svm.SVC(kernel='rbf'),LogisticRegression(),KNeighborsClassifier(n_neighbors=9),DecisionTreeClassifier(),GaussianNB(),RandomForestClassifier(n_estimators=100)]
for i in models:
    model = i
    cv_result = cross_val_score(model, X, Y, cv = 5, scoring = "accuracy")
    mean.append(cv_result.mean())
    std.append(cv_result.std())
    accuracy.append(cv_result)
    
new_models_dataframe2=pd.DataFrame({'CV Mean':mean,'Std':std}, index=classifiers)       
new_models_dataframe2
CV Mean Std
Linear Svm 0.787725 0.020357
Radial Svm 0.830849 0.021551
Logistic Regression 0.787757 0.014389
KNN 0.801374 0.023998
Decision Tree 0.792309 0.023999
Naive Bayes 0.805919 0.015594
Random Forest 0.794581 0.023920
plt.subplots(figsize=(12,6))
box=pd.DataFrame(accuracy,index=[classifiers])
box.T.boxplot()
<AxesSubplot:>

accuracy
[array([0.80225989, 0.80681818, 0.78409091, 0.75      , 0.79545455]),
 array([0.85310734, 0.82386364, 0.81818182, 0.80113636, 0.85795455]),
 array([0.7740113 , 0.80113636, 0.79545455, 0.76704545, 0.80113636]),
 array([0.79096045, 0.76704545, 0.80113636, 0.80681818, 0.84090909]),
 array([0.76836158, 0.80113636, 0.77840909, 0.77840909, 0.83522727]),
 array([0.79096045, 0.80681818, 0.80113636, 0.79545455, 0.83522727]),
 array([0.76836158, 0.80113636, 0.79545455, 0.77272727, 0.83522727])]

The classification accuracy can be sometimes misleading due to imbalance. We can get a summarized result with the help of confusion matrix, which shows where did the model go wrong, or which class did the model predict wrong.

Confusion Matrix

It gives the number of correct and incorrect classifications made by the classifier.

f,ax=plt.subplots(3,3,figsize=(12,10))
y_pred = cross_val_predict(svm.SVC(kernel='rbf'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')
y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')
y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for KNN')
y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for Random-Forests')
y_pred = cross_val_predict(LogisticRegression(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Logistic Regression')
y_pred = cross_val_predict(DecisionTreeClassifier(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Decision Tree')
y_pred = cross_val_predict(GaussianNB(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Naive Bayes')
plt.subplots_adjust(hspace=0.2,wspace=0.2)
plt.show()

Interpreting Confusion Matrix

The left diagonal shows the number of correct predictions made for each class while the right diagonal shows the number of wrong prredictions made. Lets consider the first plot for rbf-SVM:

1)The no. of correct predictions are 491(for dead) + 247(for survived) with the mean CV accuracy being (491+247)/891 = 82.8% which we did get earlier.

2)Errors–> Wrongly Classified 58 dead people as survived and 95 survived as dead. Thus it has made more mistakes by predicting dead as survived.

By looking at all the matrices, we can say that rbf-SVM has a higher chance in correctly predicting dead passengers but NaiveBayes has a higher chance in correctly predicting passengers who survived.

Hyper-Parameters Tuning

The machine learning models are like a Black-Box. There are some default parameter values for this Black-Box, which we can tune or change to get a better model. Like the C and gamma in the SVM model and similarly different parameters for different classifiers, are called the hyper-parameters, which we can tune to change the learning rate of the algorithm and get a better model. This is known as Hyper-Parameter Tuning.

We will tune the hyper-parameters for the 2 best classifiers i.e the SVM and RandomForests.

### SVM
from sklearn.model_selection import GridSearchCV
C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel=['rbf','linear']
hyper={'kernel':kernel,'C':C,'gamma':gamma}
gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,cv=5, verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 240 candidates, totalling 1200 fits
0.8308487416538265
SVC(C=0.6, gamma=0.1)
### Random Forests

n_estimators=range(100,1000,100)
hyper={'n_estimators':n_estimators}
gd=GridSearchCV(estimator=RandomForestClassifier(random_state=0),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
0.7957113507960966
RandomForestClassifier(n_estimators=200, random_state=0)
  • The best score for Rbf-Svm is 83.16% with C=0.7 and gamma=0.1. For RandomForest, score is abt 79.57% with n_estimators=200.

Ensembling

Ensembling is a good way to increase the accuracy or performance of a model. In simple words, it is the combination of various simple models to create a single powerful model.

Lets say we want to buy a phone and ask many people about it based on various parameters. So then we can make a strong judgement about a single product after analysing all different parameters. This is Ensembling, which improves the stability of the model. Ensembling can be done in ways like:

1) Voting Classifier

2) Bagging

3) Boosting.

Voting Classifier

It is the simplest way of combining predictions from many different simple machine learning models. It gives an average prediction result based on the prediction of all the submodels. The submodels or the basemodels are all of diiferent types.

from sklearn.ensemble import VotingClassifier
ensemble_lin_rbf=VotingClassifier(estimators=[('KNN',KNeighborsClassifier(n_neighbors=10)),
                                              ('RBF',svm.SVC(probability=True,kernel='rbf',C=0.5,gamma=0.1)),
                                              ('RFor',RandomForestClassifier(n_estimators=500,random_state=0)),
                                              ('LR',LogisticRegression(C=0.05)),
                                              ('DT',DecisionTreeClassifier(random_state=0)),
                                              ('NB',GaussianNB()),
                                              ('svm',svm.SVC(kernel='linear',probability=True))
                                             ], 
                       voting='soft').fit(train_X,train_Y)
print('The accuracy for ensembled model is:',ensemble_lin_rbf.score(test_X,test_Y))
cross=cross_val_score(ensemble_lin_rbf,X,Y, cv = 10,scoring = "accuracy")
print('The cross validated score is',cross.mean())
The accuracy for ensembled model is: 0.8150943396226416
The cross validated score is 0.8195097037793667

Bagging

Bagging is a general ensemble method. It works by applying similar classifiers on small partitions of the dataset and then taking the average of all the predictions. Due to the averaging,there is reduction in variance. Unlike Voting Classifier, Bagging makes use of similar classifiers.

Bagged KNN

Bagging works best with models with high variance. An example for this can be Decision Tree or Random Forests. We can use KNN with small value of n_neighbours, as small value of n_neighbours.

from sklearn.ensemble import BaggingClassifier
model=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3),random_state=0,n_estimators=700)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged KNN is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged KNN is:',result.mean())
The accuracy for bagged KNN is: 0.8150943396226416
The cross validated score for bagged KNN is: 0.8070863125638408

Bagged DecisionTree

model=BaggingClassifier(base_estimator=DecisionTreeClassifier(),random_state=0,n_estimators=100)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged Decision Tree is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged Decision Tree is:',result.mean())
The accuracy for bagged Decision Tree is: 0.8113207547169812
The cross validated score for bagged Decision Tree is: 0.7991445352400408

Boosting

Boosting is an ensembling technique which uses sequential learning of classifiers. It is a step by step enhancement of a weak model.Boosting works as follows:

A model is first trained on the complete dataset. Now the model will get some instances right while some wrong. Now in the next iteration, the learner will focus more on the wrongly predicted instances or give more weight to it. Thus it will try to predict the wrong instance correctly. Now this iterative process continous, and new classifers are added to the model until the limit is reached on the accuracy.

AdaBoost(Adaptive Boosting)

The weak learner or estimator in this case is a Decsion Tree. But we can change the dafault base_estimator to any algorithm of our choice.

from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.1)
result=cross_val_score(ada,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for AdaBoost is:',result.mean())
The cross validated score for AdaBoost is: 0.825191521961185

Stochastic Gradient Boosting

Here too the weak learner is a Decision Tree.

from sklearn.ensemble import GradientBoostingClassifier
grad=GradientBoostingClassifier(n_estimators=500,random_state=0,learning_rate=0.1)
result=cross_val_score(grad,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for Gradient Boosting is:',result.mean())
The cross validated score for Gradient Boosting is: 0.8104954034729316

XGBoost

import xgboost as xg
xgboost=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
result=cross_val_score(xgboost,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for XGBoost is:',result.mean())
The cross validated score for XGBoost is: 0.8002808988764045

We got the highest accuracy for AdaBoost. We will try to increase it with Hyper-Parameter Tuning

Hyper-Parameter Tuning for AdaBoost

n_estimators=list(range(100,1100,100))
learn_rate=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
hyper={'n_estimators':n_estimators,'learning_rate':learn_rate}
gd=GridSearchCV(estimator=AdaBoostClassifier(), param_grid=hyper, verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 120 candidates, totalling 600 fits
0.8308487416538265
AdaBoostClassifier(learning_rate=0.05, n_estimators=200)

The maximum accuracy we can get with AdaBoost is 83.16% with n_estimators=200 and learning_rate=0.05

Confusion Matrix for the Best Model

ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.05)
result=cross_val_predict(ada,X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,result),cmap='RdYlGn',annot=True,fmt='2.0f')
plt.show()

Feature Importance

f,ax=plt.subplots(2,2,figsize=(15,12))
model=RandomForestClassifier(n_estimators=500,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
ax[0,0].set_title('Feature Importance in Random Forests')
model=AdaBoostClassifier(n_estimators=200,learning_rate=0.05,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1],color='#ddff11')
ax[0,1].set_title('Feature Importance in AdaBoost')
model=GradientBoostingClassifier(n_estimators=500,learning_rate=0.1,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0],cmap='RdYlGn_r')
ax[1,0].set_title('Feature Importance in Gradient Boosting')
model=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
model.fit(X,Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,1],color='#FD0F00')
ax[1,1].set_title('Feature Importance in XgBoost')
plt.show()

We can see the important features for various classifiers like RandomForests, AdaBoost,etc.

Observations:

1)Some of the common important features are Initial, Fare_cat, Pclass, Family_Size.

2)The Sex feature doesn’t seem to give any importance, which is shocking as we had seen earlier that Sex combined with Pclass was giving a very good differentiating factor. Sex looks to be important only in RandomForests.

However, we can see the feature Initial, which is at the top in many classifiers.We had already seen the positive correlation between Sex and Initial, so they both refer to the gender.

3)Similarly the Pclass and Fare_cat refer to the status of the passengers and Family_Size with Alone,Parch and SibSp.ㅡ

I hope all of you did gain some insights to Machine Learning. Some other great notebooks for Machine Learning are:

Leave a comment