Kaggle-Titanic Data
-
Goal: We use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. -
Metric: The percentage of passengers you correctly predict, known asaccuracy.
Dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix
from collections import Counter
plt.style.use('seaborn')
sns.set(font_scale=1)
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
df_train = pd.read_csv('input/train.csv')
df_test = pd.read_csv('input/test.csv')
df_train.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df_train.info()
df_test.info()
print(df_train.describe())
print(df_test.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
PassengerId Pclass Age SibSp Parch Fare
count 418.000000 418.000000 332.000000 418.000000 418.000000 417.000000
mean 1100.500000 2.265550 30.272590 0.447368 0.392344 35.627188
std 120.810458 0.841838 14.181209 0.896760 0.981429 55.907576
min 892.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 996.250000 1.000000 21.000000 0.000000 0.000000 7.895800
50% 1100.500000 3.000000 27.000000 0.000000 0.000000 14.454200
75% 1204.750000 3.000000 39.000000 1.000000 0.000000 31.500000
max 1309.000000 3.000000 76.000000 8.000000 9.000000 512.329200
There exist Null data (The count of PassengerID is different from that of Age, Cabin, Embarked (Age, Cabin, Fare) in the training set (the test set).)
Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.
## Null data check
#df_train = df_train.fillna(np.nan)
df_train.isnull().sum() #checking for total null values
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
#df_test = df_test.fillna(np.nan)
df_test.isnull().sum()
PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64
Age, Cabin and Embarked features have null values.
for col in df_train.columns:
msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'. format(col, 100*(df_train[col].isnull().sum() / df_train[col].shape[0]))
print(msg)
column: PassengerId Percent of NaN value: 0.00% column: Survived Percent of NaN value: 0.00% column: Pclass Percent of NaN value: 0.00% column: Name Percent of NaN value: 0.00% column: Sex Percent of NaN value: 0.00% column: Age Percent of NaN value: 19.87% column: SibSp Percent of NaN value: 0.00% column: Parch Percent of NaN value: 0.00% column: Ticket Percent of NaN value: 0.00% column: Fare Percent of NaN value: 0.00% column: Cabin Percent of NaN value: 77.10% column: Embarked Percent of NaN value: 0.22%
for col in df_test.columns:
msg = 'column: {:>10}\t Percent of NaN value: {:.2f}%'.format(col, 100*(df_test[col].isnull().sum() / df_test[col].shape[0]))
print(msg)
column: PassengerId Percent of NaN value: 0.00% column: Pclass Percent of NaN value: 0.00% column: Name Percent of NaN value: 0.00% column: Sex Percent of NaN value: 0.00% column: Age Percent of NaN value: 20.57% column: SibSp Percent of NaN value: 0.00% column: Parch Percent of NaN value: 0.00% column: Ticket Percent of NaN value: 0.00% column: Fare Percent of NaN value: 0.24% column: Cabin Percent of NaN value: 78.23% column: Embarked Percent of NaN value: 0.00%
## Outlier detection
- I used the Tukey method (Tukey JW., 1977) to detect outliers.
def get_outliers(df,n,columns):
outlier_indices = []
for col in columns:
q1 = np.percentile(df[col],25)
q3 = np.percentile(df[col],75)
iqr = q3 - q1
iqr_weight = iqr*1.5
outlier_lists = df[(df[col] < q1 - iqr_weight) | (df[col] > q3 + iqr_weight)].index
outlier_indices.extend(outlier_lists)
# select observations containing more than 2 outliers
outlier_indices = Counter(outlier_indices) #e.g. # Counter({'abc': 2, 'bcd': 1, 'cde': 1})
multiple_outliers = list( li for li, v in outlier_indices.items() if v > n ) #li=list, v=number
return multiple_outliers
# detect outliers from numerical features: Age, SibSp , Parch and Fare
outliers_final = get_outliers(df_train,2,["Age","SibSp","Parch","Fare"]) # Outliers as subjects that have at least three outlied values.
#[27, 88, 159, 180, 201, 324, 341, 792, 846, 863]
df_train.loc[outliers_final] #Show the outlied subjects
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.00 | C23 C25 C27 | S |
| 88 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.0 | 3 | 2 | 19950 | 263.00 | C23 C25 C27 | S |
| 159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
| 180 | 181 | 0 | 3 | Sage, Miss. Constance Gladys | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
| 201 | 202 | 0 | 3 | Sage, Mr. Frederick | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
| 324 | 325 | 0 | 3 | Sage, Mr. George John Jr | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
| 341 | 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24.0 | 3 | 2 | 19950 | 263.00 | C23 C25 C27 | S |
| 792 | 793 | 0 | 3 | Sage, Miss. Stella Anna | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
| 846 | 847 | 0 | 3 | Sage, Mr. Douglas Bullen | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
| 863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S |
#Drop outliers
df_train = df_train.drop(outliers_final, axis=0).reset_index(drop=True)
The distribution of Target Label
f, ax = plt.subplots(ncols=2, figsize=(8,4))
f.tight_layout()
df_train['Survived'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')
sns.countplot('Survived', data=df_train, ax=ax[1])
ax[1].set_title('Count plot - Survived')
plt.show()
-
It is evident that not many passengers survived the accident.
-
Out of 891 passengers in training set, only around 350 survived i.e Only 38.4% of the total training set survived the crash.
-
The distribution of target label is approximately balanced.
Exploratory Data Analysis (EDA)
1) Analysis of the features.
2) Finding any relations or trends considering multiple features.
Types of Features
-
Features: Categorical Features/ Ordinal Features/ Continuous Feature
-
Pclass: 1 = 1st, 2 = 2nd, 3 = 3rd/ Ordinal Features
-
sex: male, female/ Categorical Features
-
Age: Continous Features
-
sibSp: # of siblings / spouses aboard the Titanic/ Discrete Feature
-
parch: # of parents / children aboard the Titanic/ Discrete Feature
-
ticket: Ticket number/ alphabat + integer string
-
fare: Passenger fare/ Continous Features
-
cabin: Cabin number/ alphabat + integer string
-
Embarked: C = Cherbourg, Q = Queenstown, S = Southampton string/ Categorical Features
-
Analysing The Features
Pclass
-
Ordinal data
-
Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
df_train[['Pclass','Survived']].groupby(['Pclass']).count()
#, as_index=True
| Survived | |
|---|---|
| Pclass | |
| 1 | 213 |
| 2 | 184 |
| 3 | 484 |
#Only the count of survived=1 for each Pclass
#as_index=True : default
df_train[['Pclass','Survived']].groupby(['Pclass']).sum()
| Survived | |
|---|---|
| Pclass | |
| 1 | 134 |
| 2 | 87 |
| 3 | 119 |
#combine the previous tables
# all : margins=True
pd.crosstab(df_train.Pclass, df_train.Survived, margins=True).style.background_gradient(cmap='summer_r')
| Survived | 0 | 1 | All |
|---|---|---|---|
| Pclass | |||
| 1 | 79 | 134 | 213 |
| 2 | 97 | 87 | 184 |
| 3 | 365 | 119 | 484 |
| All | 541 | 340 | 881 |
df_train[['Pclass', 'Survived']].groupby(['Pclass']).mean().sort_values(by='Survived', ascending=False).plot.bar()
# The better Pclass is, the higher survival rate is.
<AxesSubplot:xlabel='Pclass'>
# Countplot in Seaborn
f, ax = plt.subplots(1,2, figsize=(18,8))
df_train['Pclass'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'], ax=ax[0])
ax[0].set_title('Number of Passengers By Pclass')
ax[0].set_ylabel('count')
sns.countplot('Pclass',hue='Survived',data=df_train, ax=ax[1])
ax[1].set_title('Pclass: Survived vs Dead')
plt.show()
-
Passenegers Of Pclass 1 were given a very high priority while they were rescued. Even though the number of Passengers in Pclass 3 is significantly higher, still the number of survival from them is very low, somewhere around 25%.
-
For Pclass 1 survived is around 63% while for Pclass2 is around 48%.
-
We conclude that Pclass affects the survival of people, y target.
-
We are going to take Pclass as a feature.
Sex
- Categorical Feature/ Binary string
df_train[['Sex','Survived']].groupby(['Sex']).mean().sort_values(by='Survived', ascending=False)
| Survived | |
|---|---|
| Sex | |
| female | 0.747573 |
| male | 0.190559 |
pd.crosstab(df_train['Sex'], df_train['Survived'], margins=True).style.background_gradient(cmap='summer_r')
| Survived | 0 | 1 | All |
|---|---|---|---|
| Sex | |||
| female | 78 | 231 | 309 |
| male | 463 | 109 | 572 |
| All | 541 | 340 | 881 |
f, ax = plt.subplots(1,2, figsize=(18,8))
df_train[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
#sns.barplot(x ='Sex', y = 'Survived', data = df_train)
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=df_train,ax=ax[1])
ax[1].set_title('Sex: Survived vs Dead')
plt.show()
-
The number of men on the ship is much more than the number of women. BUT, the number of women saved is almost twice the number of males saved. The survival rate for women on the ship is around 75% while those for men around 18-19%.
-
Sex also may play an important role in the prediction of the survivial.
Both Sex and Pclass
- Let’s look into survival rate with Sex and Pclass together.
#factorplot in seaborn
sns.factorplot('Pclass','Survived',hue='Sex', data=df_train, size=4, aspect=1.5, kind="bar")
<seaborn.axisgrid.FacetGrid at 0x7fc1e04ae130>
-
we can easily infer that survival for Women from Pclass1 is about 95-96%, as only 3 out of 94 Women from Pclass1 died.
-
It is evident that irrespective of Pclass, Women were given first priority while rescue. Even Men from Pclass1 have a very low survival rate.
-
In all classes, the probability of the survival of women would be higher than that of men.
-
As a person is in higher class, the survival of a person is higher than that in other classes.
sns.factorplot(x='Sex', y='Survived', col='Pclass',
data=df_train, satureation=1,
size=4.5 , aspect=1)
plt.show()
Age
- continuous integer
print('The oldest passenger: {:.1f} Years'.format(df_train['Age'].max()))
print('The youngest passenger: {:.1f} Years'.format(df_train['Age'].min()))
print('The mean age of the passengers: {:.1f} Years'.format(df_train['Age'].mean()))
The oldest passenger: 80.0 Years The youngest passenger: 0.4 Years The mean age of the passengers: 29.7 Years
fig, ax = plt.subplots(1,1,figsize=(9,5))
sns.distplot(df_train[df_train['Survived'] == 1]['Age'], ax=ax)
sns.distplot(df_train[df_train['Survived'] == 0]['Age'], ax=ax)
plt.legend(['Survived == 1', 'Survived == 0'])
plt.show()
#kdeplot
-
There is a peak corresponding to young passengers between 20-30.
-
It seems that very young passengers between 0-5 years have more chance to survive (Clearly there exists a peak).
#Age distribution within classes
plt.figure(figsize=(8,6))
df_train['Age'][df_train['Pclass'] == 1].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 2].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 3].plot(kind='kde')
plt.xlabel('Age')
plt.title('Age distribution within Classes')
plt.legend(['1st Class','2nd Class','3rd Class'])
<matplotlib.legend.Legend at 0x7fc1e04961f0>
- The higher a class is, the higher the proportion of old people is.
cummulate_survival_ratio = []
for i in range(1, 80):
cummulate_survival_ratio.append(df_train[df_train['Age'] < i]['Survived'].sum() / len(df_train[df_train['Age'] < i]['Survived']))
plt.figure(figsize=(7, 7))
plt.plot(cummulate_survival_ratio)
plt.title('Survival rate change depending on range of Age', y=1.02)
plt.ylabel('Survival rate')
plt.xlabel('Range of Age(0~x)')
plt.show()
Pclass, Sex, Age
# violinplot in Seaborn : Sex, Pclass, Age, Survived
#x axis : case (Pclass, Sex)
#y axis : distribution (Age)
f, ax = plt.subplots(ncols=2, figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=df_train, scale="count", split=True, ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=df_train, scale='count', split=True, ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()
- Women and young people were survived more than the others.
1)The number of children increases with Pclass and the survival rate for passenegers below Age 10(i.e children) looks to be good irrespective of the Pclass.
2)Survival chances for Passenegers aged 20-50 from Pclass1 is high and is even better for Women.
3)For males, the survival chances decreases with an increase in age.
-
the Age feature has 177 null values. To replace these NaN values, we can assign them the mean age of the dataset.
-
But the problem is, there were many people with many different ages.
-
We will check the NAME feature.
df_train['Name'].head()
0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry Name: Name, dtype: object
df_train['Initial']=0
for i in df_train:
df_train['Initial']=df_train.Name.str.extract('([A-Za-z]+)\.')
#\ = whitespace. [A-Za-z]: Strings which lie between A-Z or a-z /followed by a .(dot)
#https://blog.naver.com/good5229/221889604699
pd.crosstab(df_train.Initial,df_train.Sex).T.style.background_gradient(cmap='summer_r') #Checking the Initials with the Sex
| Initial | Capt | Col | Countess | Don | Dr | Jonkheer | Lady | Major | Master | Miss | Mlle | Mme | Mr | Mrs | Ms | Rev | Sir |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sex | |||||||||||||||||
| female | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 177 | 2 | 1 | 0 | 125 | 1 | 0 | 0 |
| male | 1 | 2 | 0 | 1 | 6 | 1 | 0 | 2 | 39 | 0 | 0 | 0 | 513 | 0 | 0 | 6 | 1 |
- There are some misspelled Initials like Mlle or Mme that stand for Miss. I will replace them with Miss and same thing for other values.
df_train['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
df_train.groupby('Initial')['Age'].mean() #lets check the average age by Initials
Initial Master 4.574167 Miss 21.837838 Mr 32.773284 Mrs 35.981818 Other 45.888889 Name: Age, dtype: float64
g =sns.factorplot(x="Initial",y="Survived",data=df_train,kind="bar", size=4)
g.set_xticklabels(["Master","Miss","Mr","Mrs", "Other"])
#g = g.set_ylabels("survival probability")
<seaborn.axisgrid.FacetGrid at 0x7fc1f044b9d0>
- Women and children first.
## Assigning the NaN Values with the Ceil values of the mean ages
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mr'),'Age']=33
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mrs'),'Age']=36
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Master'),'Age']=5
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Miss'),'Age']=22
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Other'),'Age']=46
df_train.Age.isnull().sum()
df_train.Age.isnull().any()
False
f, ax = plt.subplots(ncols=2, figsize=(18,8))
sns.distplot(df_train[df_train.Survived==0].Age, color='r', ax=ax[0])
ax[0].set_title('Survived=0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
sns.distplot(df_train[df_train.Survived==1].Age, color='g', ax=ax[1])
ax[1].set_title('Survived=1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
[<matplotlib.axis.XTick at 0x7fc2213cd160>, <matplotlib.axis.XTick at 0x7fc2213cd130>, <matplotlib.axis.XTick at 0x7fc208aefc40>, <matplotlib.axis.XTick at 0x7fc2014f9a30>, <matplotlib.axis.XTick at 0x7fc2015031c0>, <matplotlib.axis.XTick at 0x7fc201503910>, <matplotlib.axis.XTick at 0x7fc201503bb0>, <matplotlib.axis.XTick at 0x7fc2213c8fd0>, <matplotlib.axis.XTick at 0x7fc2213e5a60>, <matplotlib.axis.XTick at 0x7fc2213ee1f0>, <matplotlib.axis.XTick at 0x7fc2213ee940>, <matplotlib.axis.XTick at 0x7fc2213f50d0>, <matplotlib.axis.XTick at 0x7fc2213eed30>, <matplotlib.axis.XTick at 0x7fc2213e5f40>, <matplotlib.axis.XTick at 0x7fc2213df100>, <matplotlib.axis.XTick at 0x7fc2213f5a60>, <matplotlib.axis.XTick at 0x7fc2213fb1f0>]
f,ax=plt.subplots(1,2,figsize=(20,10))
df_train[df_train['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
df_train[df_train['Survived']==1].Age.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()
Observations:
1)The Toddlers(age<5) were saved in large numbers(The Women and Child First Policy).
2)The oldest Passenger was saved(80 years).
3)Maximum number of deaths were in the age group of 30-40.
sns.factorplot('Pclass','Survived',col='Initial',data=df_train)
plt.show()
The Women and Child first policy thus holds true irrespective of the class.
Embarked
- Categorical Value
f, ax = plt.subplots(1,1, figsize=(7,7))
df_train[['Embarked','Survived']].groupby(['Embarked']).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax)
<AxesSubplot:xlabel='Embarked'>
f,ax=plt.subplots(2, 2, figsize=(20,15))
sns.countplot('Embarked', data=df_train, ax=ax[0,0])
ax[0,0].set_title('(1) No. Of Passengers Boarded')
sns.countplot('Embarked', hue='Sex', data=df_train, ax=ax[0,1])
ax[0,1].set_title('(2) Male-Female Split for Embarked')
sns.countplot('Embarked', hue='Survived', data=df_train, ax=ax[1,0])
ax[1,0].set_title('(3) Embarked vs Survived')
sns.countplot('Embarked', hue='Pclass', data=df_train, ax=ax[1,1])
ax[1,1].set_title('(4) Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()
# The count of S has the highest value.
# The reason why the survival was low in S is the number of people in the 3rd class.
#The chances for survival for Port C is highest around 0.55 while it is lowest for S.
-
Maximum passenegers boarded from S. Majority of them being from Pclass3.
-
The Passengers coming from Cherboug (C) have more chance to survive. The reason maybe the rescue of all the Pclass1 and Pclass2 Passengers.
-
The Embark S (Southampton) looks to the port from where majority of the rich people boarded. Still the chances for survival is low here, that is because many passengers from Pclass3 around 81% didn’t survive.
-
Port Q had almost 95% of the passengers were from Pclass3.
sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=df_train)
plt.show()
- Indeed, the third class is the most frequent for passenger coming from Southampton (S) and Queenstown (Q), whereas Cherbourg passengers are mostly in first class which have the highest survival rate.
Filling Embarked NaN
As we saw that maximum passengers boarded from Port S, we replace NaN with S.
df_train['Embarked'].fillna('S',inplace=True)
df_train.Embarked.isnull().sum()
0
Family- SibSp + Parch
- discrete Feature
This feature represents whether a person is alone or with his family members.
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife
-
sibSp : quantitative integer
-
parch : quantitative integer
pd.crosstab(df_train.SibSp,df_train.Survived).style.background_gradient(cmap='summer_r')
| Survived | 0 | 1 |
|---|---|---|
| SibSp | ||
| 0 | 398 | 210 |
| 1 | 97 | 112 |
| 2 | 15 | 13 |
| 3 | 11 | 2 |
| 4 | 15 | 3 |
| 5 | 5 | 0 |
f,ax=plt.subplots(ncols=2, figsize=(14,7))
sns.barplot('SibSp','Survived',data=df_train,ax=ax[0])
ax[0].set_title('SibSp vs Survived')
sns.factorplot('SibSp','Survived',data=df_train,ax=ax[1])
ax[1].set_title('SibSp vs Survived')
plt.show()
pd.crosstab(df_train.SibSp,df_train.Pclass).style.background_gradient(cmap='summer_r')
| Pclass | 1 | 2 | 3 |
|---|---|---|---|
| SibSp | |||
| 0 | 137 | 120 | 351 |
| 1 | 71 | 55 | 83 |
| 2 | 5 | 8 | 15 |
| 3 | 0 | 1 | 12 |
| 4 | 0 | 0 | 18 |
| 5 | 0 | 0 | 5 |
The barplot and factorplot shows that if a passenger is alone onboard with no siblings, he have 34.5% survival rate. The graph roughly decreases as the number of siblings increase. Small families have more chance to survive, more than single.
This makes sense. That is, if I have a family on board, I will try to save them instead of saving myself first. Surprisingly, the survival for families with 5-8 members is 0%. The reason may be Pclass??
The reason is Pclass. The crosstab shows that Person with SibSp>3 were all in Pclass3. It is evident that all the large families in Pclass3(>3) died.
-
Figure (1) - The family size is 1 to 11.
-
Figure (2), (3) - Survival rate depending on family size. When family consists of four people, the survival rate is the highest.
Fare
- Continuous feature > Histogram
print('Highest Fare was:',df_train.Fare.max())
print('Lowest Fare was:',df_train.Fare.min())
print('Average Fare was:',df_train.Fare.mean())
Highest Fare was: 512.3292 Lowest Fare was: 0.0 Average Fare was: 31.121565607264436
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sns.distplot(df_train['Fare'], color='b', label='Skewness : {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')
The distribution of Fare features is very right-skewed. We will transform data to improve our model (to reduce this skew) by taking log for x>0.
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.distplot(df_train[df_train['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(df_train[df_train['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(df_train[df_train['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()
-
As this is also continous, we can convert into discrete values by using binning.
-
All distributions are right-skewed.
df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()
df_train['Fare'] = df_train['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sns.distplot(df_train['Fare'], color='b', label='Skewness : {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')
-
After the log transformation. the skewness of data is significantly reduced.
-
It is one of feature engineering.
Cabin
- We are not going to contain this feature because it consists of 80% NaN.
Ticket
- We have a variety of ticket numbers.
Correlation Between The Features
df_train.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Initial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 1.981001 | NaN | S | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 4.266662 | C85 | C | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 2.070022 | NaN | S | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 3.972177 | C123 | S | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 2.085672 | NaN | S | Mr |
sns.heatmap(df_train.corr(),annot=True,cmap='RdBu',linewidths=0.2) #data.corr()-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()
#(+) correlation : Blue/ (-) correlation: Red
Now from the above heatmap,we notice that the features are not much correlated. The highest correlation is 0.41 between SibSp and Parch. So we can take all features.
Feature Engineering and Data Cleaning
- It is not necessary that all the features will be important. There maybe be many redundant features which should be eliminated. Also we can get or add new features by observing or extracting information from other features.
Converting features into suitable form for modeling
Age: a continuous feature
If I say to group them by their Age, then how would you do it? If there are 30 Persons, there may be 30 age values. Now this is problematic.
We need to convert these continous values into categorical values by either Binning or Normalization. I will be using binning i.e group a range of ages into a single bin or assign them a single value.
So the maximum age of a passenger was 80. So lets divide the range from 0-80 into 5 bins. So 80/5=16. So bins of size 16.
data = df_train
data['Age_band']=0
data.loc[data['Age']<=16, 'Age_band']=0
data.loc[(data['Age']>16) & (data['Age']<=32), 'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Initial | Age_band | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 1.981001 | NaN | S | Mr | 1 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 4.266662 | C85 | C | Mrs | 2 |
data['Age_band'].value_counts() #checking the number of passenegers in each band
1 376 2 322 0 103 3 69 4 11 Name: Age_band, dtype: int64
sns.factorplot('Age_band','Survived',data=data,col='Pclass')
plt.show()
The survival rate decreases as the age increases irrespective of the Pclass.
Fare: a continuous feature
Since fare is also a continous feature, we need to convert it into ordinal value. For this we will use pandas.qcut.
So what qcut does is it splits or arranges the values according the number of bins we have passed. So if we pass for 5 bins, it will arrange the values equally spaced into 5 seperate bins or value ranges.
data['Fare_Range']=pd.qcut(data['Fare'],4)
data.Fare_Range.value_counts()
(2.066, 2.671] 224 (-0.001, 2.066] 223 (2.671, 3.418] 217 (3.418, 6.239] 217 Name: Fare_Range, dtype: int64
data.groupby(['Fare_Range'])['Survived'].mean()
Fare_Range (-0.001, 2.066] 0.197309 (2.066, 2.671] 0.303571 (2.671, 3.418] 0.456221 (3.418, 6.239] 0.594470 Name: Survived, dtype: float64
We can clearly see that as the fare_range increases, the chances of survival increases.
Now we cannot pass the Fare_Range values as it is. We should convert it into singleton values same as we did in Age_Band.
data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3
sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()
Clearly, as the Fare_cat increases, the survival chances increases. This feature may become an important feature during modeling along with the Sex.
Converting String Values into Numeric
We need to convert features such as Sex, Embarked, and Initial into numeric values.
#data['Sex'].replace(['male','female'],[0,1],inplace=True)
#data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
#data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)
# Label Encoding
def label_encoder(df, features):
for f in features:
le = LabelEncoder()
le = le.fit(df[f])
df[f] = le.transform(df[f])
return df
features = ['Sex', 'Embarked', 'Initial'] #female=0, male=1 #'C', 'Q', 'S' #'Master', 'Miss', 'Mr', 'Mrs', 'Other'
data = label_encoder(data, features)
data.head()
| Survived | Pclass | Sex | SibSp | Parch | Embarked | Initial | Age_band | Fare_cat | Family_Size | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 1 | 1 | 0 | 2 | 2 | 1 | 0 | 2 |
| 1 | 1 | 1 | 0 | 1 | 0 | 0 | 3 | 2 | 0 | 2 |
| 2 | 1 | 3 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 1 |
| 3 | 1 | 1 | 0 | 1 | 0 | 2 | 3 | 2 | 0 | 2 |
| 4 | 0 | 3 | 1 | 0 | 0 | 2 | 2 | 2 | 0 | 1 |
Adding any few features
Family_Size and Alone
-
This feature is the summation of Parch, SibSp and 1 (including the passenger). It gives us a combined data so that we can check if survival rate have anything to do with family size of the passengers.
-
We can imagine that large large families may have more difficulties to be evacuated, looking for theirs sisters/brothers/parents during the evacuation
-
Alone will denote whether a passenger is alone or not.
# Family is composed of sibSp, parch, and me (1).
data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']+1 #family size
f,ax=plt.subplots(1, 3, figsize=(40,10))
sns.countplot('Family_Size', data=data, ax=ax[0])
ax[0].set_title('(1) No. Of Passengers Boarded', y=1.02)
sns.countplot('Family_Size', hue='Survived', data=df_train, ax=ax[1])
ax[1].set_title('(2) Survived countplot depending on FamilySize', y=1.02)
data[['Family_Size', 'Survived']].groupby(['Family_Size'], as_index=True).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax[2])
ax[2].set_title('(3) Survived rate depending on FamilySize', y=1.02)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()
-
Family_Size=1 means that the passeneger is alone. Clearly, if you are alone or family_size=1,then chances for survival is very low.
-
The family size seems to play an important role, survival rates are bad for large families (family size > 4).
Removing redundant features
data.drop(['Name','Age','Ticket','Fare','Cabin','Fare_Range','PassengerId'], axis=1, inplace=True)
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2,annot_kws={'size':20})
fig=plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()
Now the above correlation plot, we can see some positively related features. Some of them being SibSp andd Family_Size and Parch and Family_Size and some negative ones like Alone and Family_Size.
Predictive Modeling
We have gained some insights from the EDA part. But with that, we cannot accurately predict or tell whether a passenger will survive or die. So now we will predict the whether the Passenger will survive or not using some great Classification Algorithms.Following are the algorithms I will use to make the model:
1)Logistic Regression
2)Support Vector Machines(Linear and radial)
3)Random Forest
4)K-Nearest Neighbours
5)Naive Bayes
6)Decision Tree
7)Logistic Regression
Running Basic Algorithms
train, test=train_test_split(data, test_size=0.3, random_state=121, stratify=data['Survived'])
Data is split in a stratified fashion. Each set contains approximately the same percentage of samples of each target class
train_X = train.iloc[:,1:]
train_Y = train.iloc[:,0]
test_X = test.iloc[:,1:]
test_Y = test.iloc[:,0]
X = data.iloc[:,1:]
Y = data.iloc[:,0] #data.Survived
Radial Support Vector Machines(rbf-SVM)
Tuning Hyperparameters
-
Kernel: The main function of the kernel is to transform the given dataset input data into the required form. There are various types of functions such as linear, polynomial, and radial basis function (RBF). Polynomial and RBF are useful for non-linear hyperplane. Polynomial and RBF kernels compute the separation line in the higher dimension. In some of the applications, it is suggested to use a more complex kernel to separate the classes that are curved or nonlinear. This transformation can lead to more accurate classifiers.
-
Regularization: Regularization parameter in python’s Scikit-learn C parameter used to maintain regularization. Here C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimization how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term. A smaller value of C creates a small-margin hyperplane and a larger value of C creates a larger-margin hyperplane.
-
Gamma: A lower value of Gamma will loosely fit the training dataset, whereas a higher value of gamma will exactly fit the training dataset, which causes over-fitting. In other words, you can say a low value of gamma considers only nearby points in calculating the separation line, while the a value of gamma considers all the data points in the calculation of the separation line.
model=svm.SVC(kernel='rbf',C=1,gamma=0.1) # default C=1
model.fit(train_X,train_Y)
prediction1=model.predict(test_X)
print('Accuracy for rbf SVM is ',metrics.accuracy_score(prediction1,test_Y))
Accuracy for rbf SVM is 0.8188679245283019
Linear Support Vector Machine(linear-SVM)
model=svm.SVC(kernel='linear',C=0.1)
model.fit(train_X,train_Y)
prediction2=model.predict(test_X)
print('Accuracy for linear SVM is',metrics.accuracy_score(prediction2,test_Y))
Accuracy for linear SVM is 0.7811320754716982
Logistic Regression
model = LogisticRegression()
model.fit(train_X,train_Y)
prediction3=model.predict(test_X)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction3,test_Y))
The accuracy of the Logistic Regression is 0.7584905660377359
Decision Tree
model=DecisionTreeClassifier()
model.fit(train_X,train_Y)
prediction4=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction4,test_Y))
The accuracy of the Decision Tree is 0.7924528301886793
K-Nearest Neighbours(KNN)
model=KNeighborsClassifier()
model.fit(train_X,train_Y)
prediction5=model.predict(test_X)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction5,test_Y))
The accuracy of the KNN is 0.7849056603773585
Now the accuracy for the KNN model changes as we change the values for n_neighbours attribute. The default value is 5. Lets check the accuracies over various values of n_neighbours.
a_index=list(range(1,11)) #less than 11.
a=pd.Series() # Creating empty Series.
x=[0,1,2,3,4,5,6,7,8,9,10]
for i in list(range(1,11)):
model=KNeighborsClassifier(n_neighbors=i)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
a=a.append(pd.Series(metrics.accuracy_score(prediction,test_Y)))
plt.plot(a_index, a)
plt.xticks(x)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()
print('Accuracies for different values of n are:',a.values,'with the max value as ',a.values.max())
Accuracies for different values of n are: [0.76603774 0.75849057 0.76981132 0.78113208 0.78490566 0.78113208 0.79622642 0.79245283 0.79622642 0.77735849] with the max value as 0.7962264150943397
Gaussian Naive Bayes
model=GaussianNB()
model.fit(train_X,train_Y)
prediction6=model.predict(test_X)
print('The accuracy of the NaiveBayes is',metrics.accuracy_score(prediction6,test_Y))
The accuracy of the NaiveBayes is 0.8
Random Forests
model=RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_Y)
prediction7=model.predict(test_X)
print('The accuracy of the Random Forests is',metrics.accuracy_score(prediction7,test_Y))
The accuracy of the Random Forests is 0.8150943396226416
Cross Validation
Can we confirm that it will be 90% for all the new test sets that come over??. The answer is No, because we can’t determine which all instances will the classifier will use to train itself. As the training and testing data changes, the accuracy will also change. It may increase or decrease. This is known as model variance.
To overcome this and get a generalized model, we use Cross Validation.
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
mean=[]
accuracy=[]
std=[]
classifiers=['Linear Svm','Radial Svm','Logistic Regression','KNN','Decision Tree','Naive Bayes','Random Forest']
models=[svm.SVC(kernel='linear'),svm.SVC(kernel='rbf'),LogisticRegression(),KNeighborsClassifier(n_neighbors=9),DecisionTreeClassifier(),GaussianNB(),RandomForestClassifier(n_estimators=100)]
for i in models:
model = i
cv_result = cross_val_score(model, X, Y, cv = 5, scoring = "accuracy")
mean.append(cv_result.mean())
std.append(cv_result.std())
accuracy.append(cv_result)
new_models_dataframe2=pd.DataFrame({'CV Mean':mean,'Std':std}, index=classifiers)
new_models_dataframe2
| CV Mean | Std | |
|---|---|---|
| Linear Svm | 0.787725 | 0.020357 |
| Radial Svm | 0.830849 | 0.021551 |
| Logistic Regression | 0.787757 | 0.014389 |
| KNN | 0.801374 | 0.023998 |
| Decision Tree | 0.792309 | 0.023999 |
| Naive Bayes | 0.805919 | 0.015594 |
| Random Forest | 0.794581 | 0.023920 |
plt.subplots(figsize=(12,6))
box=pd.DataFrame(accuracy,index=[classifiers])
box.T.boxplot()
<AxesSubplot:>
accuracy
[array([0.80225989, 0.80681818, 0.78409091, 0.75 , 0.79545455]), array([0.85310734, 0.82386364, 0.81818182, 0.80113636, 0.85795455]), array([0.7740113 , 0.80113636, 0.79545455, 0.76704545, 0.80113636]), array([0.79096045, 0.76704545, 0.80113636, 0.80681818, 0.84090909]), array([0.76836158, 0.80113636, 0.77840909, 0.77840909, 0.83522727]), array([0.79096045, 0.80681818, 0.80113636, 0.79545455, 0.83522727]), array([0.76836158, 0.80113636, 0.79545455, 0.77272727, 0.83522727])]
The classification accuracy can be sometimes misleading due to imbalance. We can get a summarized result with the help of confusion matrix, which shows where did the model go wrong, or which class did the model predict wrong.
Confusion Matrix
It gives the number of correct and incorrect classifications made by the classifier.
f,ax=plt.subplots(3,3,figsize=(12,10))
y_pred = cross_val_predict(svm.SVC(kernel='rbf'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')
y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')
y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for KNN')
y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for Random-Forests')
y_pred = cross_val_predict(LogisticRegression(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Logistic Regression')
y_pred = cross_val_predict(DecisionTreeClassifier(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Decision Tree')
y_pred = cross_val_predict(GaussianNB(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Naive Bayes')
plt.subplots_adjust(hspace=0.2,wspace=0.2)
plt.show()
Interpreting Confusion Matrix
The left diagonal shows the number of correct predictions made for each class while the right diagonal shows the number of wrong prredictions made. Lets consider the first plot for rbf-SVM:
1)The no. of correct predictions are 491(for dead) + 247(for survived) with the mean CV accuracy being (491+247)/891 = 82.8% which we did get earlier.
2)Errors–> Wrongly Classified 58 dead people as survived and 95 survived as dead. Thus it has made more mistakes by predicting dead as survived.
By looking at all the matrices, we can say that rbf-SVM has a higher chance in correctly predicting dead passengers but NaiveBayes has a higher chance in correctly predicting passengers who survived.
Hyper-Parameters Tuning
The machine learning models are like a Black-Box. There are some default parameter values for this Black-Box, which we can tune or change to get a better model. Like the C and gamma in the SVM model and similarly different parameters for different classifiers, are called the hyper-parameters, which we can tune to change the learning rate of the algorithm and get a better model. This is known as Hyper-Parameter Tuning.
We will tune the hyper-parameters for the 2 best classifiers i.e the SVM and RandomForests.
### SVM
from sklearn.model_selection import GridSearchCV
C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel=['rbf','linear']
hyper={'kernel':kernel,'C':C,'gamma':gamma}
gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,cv=5, verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 240 candidates, totalling 1200 fits 0.8308487416538265 SVC(C=0.6, gamma=0.1)
### Random Forests
n_estimators=range(100,1000,100)
hyper={'n_estimators':n_estimators}
gd=GridSearchCV(estimator=RandomForestClassifier(random_state=0),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 9 candidates, totalling 45 fits 0.7957113507960966 RandomForestClassifier(n_estimators=200, random_state=0)
- The best score for Rbf-Svm is 83.16% with C=0.7 and gamma=0.1. For RandomForest, score is abt 79.57% with n_estimators=200.
Ensembling
Ensembling is a good way to increase the accuracy or performance of a model. In simple words, it is the combination of various simple models to create a single powerful model.
Lets say we want to buy a phone and ask many people about it based on various parameters. So then we can make a strong judgement about a single product after analysing all different parameters. This is Ensembling, which improves the stability of the model. Ensembling can be done in ways like:
1) Voting Classifier
2) Bagging
3) Boosting.
Voting Classifier
It is the simplest way of combining predictions from many different simple machine learning models. It gives an average prediction result based on the prediction of all the submodels. The submodels or the basemodels are all of diiferent types.
from sklearn.ensemble import VotingClassifier
ensemble_lin_rbf=VotingClassifier(estimators=[('KNN',KNeighborsClassifier(n_neighbors=10)),
('RBF',svm.SVC(probability=True,kernel='rbf',C=0.5,gamma=0.1)),
('RFor',RandomForestClassifier(n_estimators=500,random_state=0)),
('LR',LogisticRegression(C=0.05)),
('DT',DecisionTreeClassifier(random_state=0)),
('NB',GaussianNB()),
('svm',svm.SVC(kernel='linear',probability=True))
],
voting='soft').fit(train_X,train_Y)
print('The accuracy for ensembled model is:',ensemble_lin_rbf.score(test_X,test_Y))
cross=cross_val_score(ensemble_lin_rbf,X,Y, cv = 10,scoring = "accuracy")
print('The cross validated score is',cross.mean())
The accuracy for ensembled model is: 0.8150943396226416 The cross validated score is 0.8195097037793667
Bagging
Bagging is a general ensemble method. It works by applying similar classifiers on small partitions of the dataset and then taking the average of all the predictions. Due to the averaging,there is reduction in variance. Unlike Voting Classifier, Bagging makes use of similar classifiers.
Bagged KNN
Bagging works best with models with high variance. An example for this can be Decision Tree or Random Forests. We can use KNN with small value of n_neighbours, as small value of n_neighbours.
from sklearn.ensemble import BaggingClassifier
model=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3),random_state=0,n_estimators=700)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged KNN is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged KNN is:',result.mean())
The accuracy for bagged KNN is: 0.8150943396226416 The cross validated score for bagged KNN is: 0.8070863125638408
Bagged DecisionTree
model=BaggingClassifier(base_estimator=DecisionTreeClassifier(),random_state=0,n_estimators=100)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged Decision Tree is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged Decision Tree is:',result.mean())
The accuracy for bagged Decision Tree is: 0.8113207547169812 The cross validated score for bagged Decision Tree is: 0.7991445352400408
Boosting
Boosting is an ensembling technique which uses sequential learning of classifiers. It is a step by step enhancement of a weak model.Boosting works as follows:
A model is first trained on the complete dataset. Now the model will get some instances right while some wrong. Now in the next iteration, the learner will focus more on the wrongly predicted instances or give more weight to it. Thus it will try to predict the wrong instance correctly. Now this iterative process continous, and new classifers are added to the model until the limit is reached on the accuracy.
AdaBoost(Adaptive Boosting)
The weak learner or estimator in this case is a Decsion Tree. But we can change the dafault base_estimator to any algorithm of our choice.
from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.1)
result=cross_val_score(ada,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for AdaBoost is:',result.mean())
The cross validated score for AdaBoost is: 0.825191521961185
Stochastic Gradient Boosting
Here too the weak learner is a Decision Tree.
from sklearn.ensemble import GradientBoostingClassifier
grad=GradientBoostingClassifier(n_estimators=500,random_state=0,learning_rate=0.1)
result=cross_val_score(grad,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for Gradient Boosting is:',result.mean())
The cross validated score for Gradient Boosting is: 0.8104954034729316
XGBoost
import xgboost as xg
xgboost=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
result=cross_val_score(xgboost,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for XGBoost is:',result.mean())
The cross validated score for XGBoost is: 0.8002808988764045
We got the highest accuracy for AdaBoost. We will try to increase it with Hyper-Parameter Tuning
Hyper-Parameter Tuning for AdaBoost
n_estimators=list(range(100,1100,100))
learn_rate=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
hyper={'n_estimators':n_estimators,'learning_rate':learn_rate}
gd=GridSearchCV(estimator=AdaBoostClassifier(), param_grid=hyper, verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 5 folds for each of 120 candidates, totalling 600 fits 0.8308487416538265 AdaBoostClassifier(learning_rate=0.05, n_estimators=200)
The maximum accuracy we can get with AdaBoost is 83.16% with n_estimators=200 and learning_rate=0.05
Confusion Matrix for the Best Model
ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.05)
result=cross_val_predict(ada,X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,result),cmap='RdYlGn',annot=True,fmt='2.0f')
plt.show()
Feature Importance
f,ax=plt.subplots(2,2,figsize=(15,12))
model=RandomForestClassifier(n_estimators=500,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
ax[0,0].set_title('Feature Importance in Random Forests')
model=AdaBoostClassifier(n_estimators=200,learning_rate=0.05,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1],color='#ddff11')
ax[0,1].set_title('Feature Importance in AdaBoost')
model=GradientBoostingClassifier(n_estimators=500,learning_rate=0.1,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0],cmap='RdYlGn_r')
ax[1,0].set_title('Feature Importance in Gradient Boosting')
model=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
model.fit(X,Y)
pd.Series(model.feature_importances_, X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,1],color='#FD0F00')
ax[1,1].set_title('Feature Importance in XgBoost')
plt.show()
We can see the important features for various classifiers like RandomForests, AdaBoost,etc.
Observations:
1)Some of the common important features are Initial, Fare_cat, Pclass, Family_Size.
2)The Sex feature doesn’t seem to give any importance, which is shocking as we had seen earlier that Sex combined with Pclass was giving a very good differentiating factor. Sex looks to be important only in RandomForests.
However, we can see the feature Initial, which is at the top in many classifiers.We had already seen the positive correlation between Sex and Initial, so they both refer to the gender.
3)Similarly the Pclass and Fare_cat refer to the status of the passengers and Family_Size with Alone,Parch and SibSp.ㅡ
I hope all of you did gain some insights to Machine Learning. Some other great notebooks for Machine Learning are:
-
For Python: Pytanic by Heads and Tails
-
For Python: Introduction to Ensembling/Stacking by Anisotropic
Leave a comment