Classification

Naive Bayes Classifiers

Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models but, they tend to be even faster in training.

  • Naive Bayes models are so efficient is that they learn parameters by looking at each feature individually and collect simple per-class statistics from each feature.

  • GaussianNB can be applied to any continuous data, while BernoulliNB assumes binary data and MultinomialNB assumes count data (that is, that each feature represents an integer count of something, like how often a word appears in a sentence).

  • MultinomialNB takes into account the average value of each feature for each class, while GaussianNB stores the average value as well as the standard deviation of each feature for each class.

  • MultinomialNB and BernoulliNB have a single parameter, alpha, which controls model complexity. A large alpha means more smoothing, resulting in less complex models.

  • GaussianNB is mostly used on very high-dimensional data, while the other two variants of naive Bayes are widely used for sparse count data such as text.

  • The naive Bayes models share many of the strengths and weaknesses of the linear models. They are very fast to train and to predict, and the training procedure is easy to understand. The models work very well with high-dimensional sparse data and are relatively robust to the parameters.

Decision trees

Decision trees are widely used models for classification and regression tasks. Essentially,

they learn a hierarchy of if/else questions, leading to a decision.

  • root: the whole dataaset.

  • Decision Node: Each node in the tree either represents a question.

  • Leaf Node: a terminal node.

Learning a decision tree means learning the sequence of if/else questions that gets us to the true answer most quickly.

Pros and Cons

  • Advantage? :

    • The resulting model can easily be visualized and understood by nonexperts.

    • The algorithms are completely invariant to scaling of the data.

      • As each feature is processed separately, and the possible splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of features is needed for decision tree algorithms.
  • Disadvantage? : Overfitting

    • The presence of pure leaves mean that a tree is 100% accurate on the training set; Each data point in the training set is in a leaf that has the correct majority class.

    • Even with the use of pre-pruning, they tend to overfit and provide poor generalization performance. In most applications, we usually use the ensemble methods.

  • Strategies to prevent overfitting

    • Pre-pruning: stopping the creation of the tree early

      • limiting the maximum depth of the tree, limiting the maximum number of leaves, requiring a minimum number of points in a node to keep splitting it (max_depth, max_leaf_nodes, or min_samples_leaf).

      • If we don’t restrict the depth of a decision tree, the tree can become arbitrarily deep and complex. Unpruned trees are therefore prone to overfitting and not generalizing well to new data.

    • Post-pruning: building the tree but then removing or collapsing nodes that contain little information.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mglearn

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
Accuracy on training set: 1.000
Accuracy on test set: 0.937

Let’s apply pre-pruning to the tree, which will stop developing

the tree before we perfectly fit to the training data. Limiting the

depth of the tree decreases overfitting.

tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
Accuracy on training set: 0.988
Accuracy on test set: 0.951

Feature importance in trees

  • It is a number between 0 and 1 for each feature, where 0 means “not used at all” and 1 means “perfectly predicts the target.”

  • The feature importances always sum to 1.

print("Feature importances:\n{}".format(tree.feature_importances_))
Feature importances:
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.01019737 0.04839825
 0.         0.         0.0024156  0.         0.         0.
 0.         0.         0.72682851 0.0458159  0.         0.
 0.0141577  0.         0.018188   0.1221132  0.01188548 0.        ]
def plot_feature_importances_cancer(model):
    n_features = cancer.data.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), cancer.feature_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)
    
plot_feature_importances_cancer(tree)

The feature used in the top split (“worst radius”) is by far **the most

important feature.** This confirms our observation in analyzing the tree that the first level already separates the two classes fairly well.

However, if a feature has a low value in feature_importance_, it doesn’t mean that this feature is uninformative.

It only means that the feature was not picked by the tree, likely because another feature encodes the same information.

Ensembles of Decision Trees

Ensembles are methods that combine multiple machine learning models to create more powerful models.

Random forests

Random forests are one way to address this problem. A random forest

is essentially a collection of decision trees, where each tree is slightly different from the others.

The idea behind random forests is that each tree might do a relatively good job of predicting, but will likely overfit on part of the data. If we build many trees, all of which work well and overfit in different ways, we can reduce the amount of overfitting by averaging their results. This reduction in overfitting, while retaining the predictive power of the trees, can be shown using rigorous mathematics.

Pros and Cons

  • Advantage? :

    • They are very powerful, often work well without heavy tuning of the parameters, and don’t require scaling of the data.

    • While building random forests on large datasetsmight be somewhat time consuming, it can be parallelized across multiple CPU cores within a computer easily. You can set n_jobs=-1 to use all the cores in your computer.

  • Disadvantage?

    • Random forests usually work well even on very large datasets, and training can easily be parallelized over many CPU cores within a powerful computer. However, random forests require more memory and are slower to train and to predict than linear models.

Random forests don’t tend to perform well on very high dimensional, sparse data, such as text data. For this kind of data, linear models might be more appropriate.

Process: Bagging (Bootstrap Aggregating)

  • To build a random forest model, you need to decide on the number of trees to build (the n_estimators parameter). Let’s we build 10 trees.

  • To build a tree, we first take what is called a bootstrap sample of our data. From our n_samples data points, we repeatedly draw an example randomly with replacement (meaning the same sample can be picked multiple times).

  • Decision tree is built based on this newly created dataset.

  • Instead of looking for the best test for each node, in each node the algorithm randomly selects a subset of the features, and it looks for the best possible test involving one of these features. The number of features that are selected is controlled by the max_features parameter.

A critical parameter in this process is max_features. If we set max_features to n_features, that means that each split can look at all features in the dataset, and no randomness will be injected in the feature selection (the randomness due to the bootstrapping remains, though).

  • a high max_features means that the trees in the random forest will be quite similar, and they will be able to fit the data easily, using the most distinctive features.

  • A low max_features means that the trees in the random forest will be quite different, and that each tree might need to be very deep in order to fit the data well.

Prediction?

  • To make a prediction using the random forest, the algorithm first makes a prediction for every tree in the forest.

  • For regression, we can average these results to get our final prediction.

  • For classification, a soft voting strategy is used. Each algorithm makes a “soft” prediction, providing a probability for each possible output label. The probabilities predicted by all the trees are averaged, and the class with the highest probability is predicted (The average of each class probabilities in all trees).

Parameters

  • n_estimators: larger is always better. Averaging more trees will yield a more robust ensemble by reducing overfitting. However, there are diminishing returns, and more trees need more memory and more time to train.

    • A common rule of thumb is to build “as many as you have time/memory for.”
  • max_features: determines how random each tree is. A smaller max_features reduces overfitting.

    • A good rule of thumb to use the default values: max_features=sqrt(n_features) for classification and max_fea tures=n_features for regression.
  • Adding max_features or max_leaf_nodes might sometimes improve performance. It can also drastically reduce space and time requirements for training and prediction.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=42)
forest = RandomForestClassifier(n_estimators=5, random_state=2, n_jobs=-1)
forest.fit(X_train, y_train)
RandomForestClassifier(n_estimators=5, n_jobs=-1, random_state=2)
fig, axes = plt.subplots(2, 3, figsize=(20, 10))
for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
    ax.set_title("Tree {}".format(i))
    mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)

mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1],
alpha=.4)
axes[-1, -1].set_title("Random Forest")
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
[<matplotlib.lines.Line2D at 0x7f957940cf10>,
 <matplotlib.lines.Line2D at 0x7f957940ca60>]

  • The decision boundaries learned by the five trees are quite different.

  • Each of them makes some mistakes, as some of the training points that are plotted here were not actually included in the training sets of the trees, due to the bootstrap sampling.

  • The random forest overfits less than any of the trees individually, and provides a much more intuitive decision boundary.

X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))
plot_feature_importances_cancer(forest)
Accuracy on training set: 1.000
Accuracy on test set: 0.972

Similarly to the single decision tree, the random forest also gives

a lot of importance to the “worst radius” feature, but it actually chooses “worst perimeter” to be the most informative feature overall. The randomness in building the random forest forces the algorithm to consider many possible explanations, the result being that the random forest captures a much broader picture of the data than a single tree.

GBM (Gradient boosting machines)

The gradient boosted regression tree is another ensemble method that combines multiple decision trees to create a more powerful model.

In contrast to the

random forest approach, gradient boosting works by building trees in a serial manner,

where each tree tries to correct the mistakes of the previous one.

The main idea behind gradient boosting is to combine many simple models (in this context known as weak learners), like shallow trees.

Each tree can only provide good predictions on part of the data, and so more and more trees are added to iteratively improve performance.

They are generally a bit more sensitive to parameter settings than random forests, but can provide better accuracy if the parameters

are set correctly.

Pros and Cons

  • Advantage? :

    • Gradient boosted decision trees are among the most powerful and widely used models for supervised learning.

    • Similarly to other tree-based models, the algorithm works well without scaling and on a mixture of binary and continuous features.

  • Disadvantage?

    • They require careful tuning of the parameters and may take a long time to train.

    • As with other tree-based models, it also often does not work well on high-dimensional sparse data.

Process

  • As both gradient boosting and random forests perform well on similar kinds of data, a common approach is to first try random forests, which work quite robustly.

  • If random forests work well but prediction time is at a premium, or it is important to squeeze out the last percentage of accuracy from the machine learning model, moving to gradient boosting often helps.

  • If you want to apply gradient boosting to a large-scale problem, it might be worth looking into the xgboost package and its Python interface which at the time of writing is faster (and sometimes easier to tune) than the scikit-learn implementation of gradient boosting on many datasets.

Parameter

  • Another important parameter of gradient boosting is the learning_rate, which controls how strongly each tree tries to correct the mistakes of the previous trees.

    • A higher learning rate means each tree can make stronger corrections, allowing for more complex models.
  • n_estimators (The number of trees) also increases the model complexity, as the model has more chances to correct mistakes on the training set.

  • These two parameters are highly interconnected, as a lower learning_rate means that more trees are needed to build a model of similar complexity.

  • In contrast to random forests, where a higher n_estimators value is always better, increasing n_estimators in gradient boosting leads to a more complex model, which may lead to overfitting.

  • A common practice is to fit n_estimators depending on the time and memory budget, and then search over different learning_rates.

  • Another important parameter is max_depth (or alternatively max_leaf_nodes), to reduce the complexity of each tree (Usually max_depth is often not deeper than five splits).

from sklearn.ensemble import GradientBoostingClassifier
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0)
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 1.000
Accuracy on test set: 0.965

As the training set accuracy is 100%, we are likely to be overfitting. - To reduce overfitting, we could either apply stronger pre-pruning by limiting the maximum depth or lower the learning rate:

gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 0.991
Accuracy on test set: 0.972

In this case, lowering the maximum depth of the trees provided a significant

improvement of the model, while lowering the learning rate only increased the

generalization performance slightly. As we used 100 trees, it

is impractical to inspect them all, even if they are all of depth 1.

gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
plot_feature_importances_cancer(gbrt)

The feature importances of the gradient boosted trees are somewhat

similar to the feature importances of the random forests, though the gradient boosting

completely ignored some of the features.

XGBoost

XGBoost (or eXtreme Gradient Boost) optimizes the performance of algorithms, primarily decision trees, in a gradient boosting framework while minimizing overfitting/bias through regularization.

It is arguably the most powerful algorithm and is increasingly being used in all industries and in all problem domains. It is also a winning algorithm in many machine learning competitions. In fact, XGBoost was used in 17 out of 29 data science competitions on the Kaggle platform.

A key to its performance is its hyperparameters. While XGBoost is extremely easy to implement, the hard part is tuning the hyperparameters.

Important features of XGBoost include:

  • parallel processing capabilities for large dataset

  • can handle missing values

  • can handle imbalanced dataset

  • allows for regularization to prevent overfitting

  • has built-in cross-validation

  • Early stopping: An approach to training complex machine learning models to avoid overfitting. It works by stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations.

Pros and Cons

Pros:

A large and growing list of data scientist globally that are actively contributing to XGBoost open source development.

Usage on a wide range of applications, including solving problems in regression, classification and user-defined prediction challenges.

A library that was built from the ground up to be efficient, felxtible and portable

By using Greedy-Algorithm, Automatic pruning is possible. Therefore, Overfitting rarely occurs.

Cons:

Although the speed problem, which is a disadvantage of GBM, has been solved to extent, but it is still slow.

If Hyperparameter modify use GridSearchCV, Speed is very slow.

Parameters

Three types: General parameters (Guide the overall functioning), Booster parameters (Guide the individual booster (tree/regression) at each step), Task parameters (Guide the optimization performed)

Frequently tuned hyperparameters: you always tune the following parameters to optimize model performance.

  • n_estimators: the number of decision trees to be boosted. If n_estimator = 1, it means only 1 tree is generated, thus no boosting is at work. The default value is 100, but you can play with this number for optimal performance.

  • subsample: the subsample ratio of the training sample (each tree). A subsample = 0.5 means that 50% of training data is used prior to growing a tree. The value can be any fraction but the default value is 1.

  • max_depth(default: 6): it limits how deep each tree can grow. The default value is 6 but you can try other values if overfitting is an issue in your model.

  • learning_rate (alias: eta): it is a regularization parameter that shrinks feature weights in each boosting step. The default value is 0.3 but people generally tune with values such as 0.01, 0.1, 0.2 etc.

  • gamma (alias: min_split_loss): it’s another regularization parameter for tree pruning. It specifies the minimum loss reduction required to grow a tree. The default value is set at 0.

  • reg_alpha (alias: alpha): L1 regularization parameter. Default is 0.

  • reg_lambda (alias: lambda): L2 regularization parameter. Default is 1.

Special use hyperparameters

  • scale_pos_weight: This parameter is useful in case you have an imbalanced dataset, particularly in classification problems, where the proportion of one class is a small fraction of total observations (e.g. credit card fraud). The default value is 1, but you can use the following ratio: total negative instance (e.g. no-fraud)/ total positive instance (e.g. fraud).

  • monotone_constraints: You can activate this parameter if you want to increase the constraint on the predictors, for example, a non-linear, increasing likelihood of credit-loan approval with a higher credit score.

  • booster: you can choose what kind of booster method to use. You have three options: ‘dart’, ‘gbtree ’ (tree-based) and ‘gblinear ’ (Ridge regression).

  • missing: it’s not missing value treatment exactly, it’s rather used to specify under what circumstances the algorithm should treat a value as missing (e.g. a negative value of the age of a customer certainly is impossible, thus the algorithm treats it as a missing value).

  • eval_metric: it specifies what loss function to use, e.g MAE, MSE, RMSE for regression and log loss for classification.

Visualization about XGBoost

  • XGB_model.plot_importance : indicate the importance of a characteristic

  • XGB_model.plot_tree : indicate the Decision Tree

Reference

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import xgboost as xgb
from xgboost import plot_importance
from xgboost import XGBClassifier

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score
import warnings
warnings.filterwarnings('ignore')
#version = 0.9
print(xgb.__version__)
1.7.5
dataset = load_breast_cancer()
X_features = dataset.data
y_label = dataset.target
cancer_df = pd.DataFrame(data=X_features,columns=dataset.feature_names)
cancer_df['target'] = y_label #adding up y label in the last column
cancer_df.head(3)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0

3 rows × 31 columns

print(dataset.target_names)
print(cancer_df['target'].value_counts()) #malignant = 1, benign=0
['malignant' 'benign']
1    357
0    212
Name: target, dtype: int64
# Split data into Train (80%)/Test data (20%)
X_train, X_test, y_train, y_test = train_test_split(X_features, y_label, test_size=0.2, random_state=156)
print(X_train.shape, X_test.shape)
(455, 30) (114, 30)

As an extra step, we need to store data into a compatible DMatrix object for XGBoost compatibility.

dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)

XGBoost supports a suite of evaluation metrics not limited to:

  • rmse for root mean squared error.

  • mae for mean absolute error.

  • logloss for binary logarithmic loss and “mlogloss” for multi-class log loss (cross entropy).

  • error for classification error.

  • auc for area under ROC curve.

params = {'max_depth':3,
         'eta': 0.1,
         'objective':'binary:logistic',
         'eval_metric':'logloss',
         'early_stoppings':100 
         }
num_rounds = 400

We define our test set as ‘eval’ to use early stopping.

wlist = [(dtrain,'train'),(dtest,'eval')]
#xgb.train() 함수의 파라미터로 전달
xgb_model = xgb.train(params = params, dtrain=dtrain, num_boost_round=num_rounds, \
                     early_stopping_rounds=100, evals=wlist)
[16:10:09] WARNING: /Users/runner/work/xgboost/xgboost/python-package/build/temp.macosx-10.9-x86_64-cpython-38/xgboost/src/learner.cc:767: 
Parameters: { "early_stoppings" } are not used.

[0]	train-logloss:0.60969	eval-logloss:0.61352
[1]	train-logloss:0.54080	eval-logloss:0.54784
[2]	train-logloss:0.48375	eval-logloss:0.49425
[3]	train-logloss:0.43446	eval-logloss:0.44799
[4]	train-logloss:0.39055	eval-logloss:0.40911
[5]	train-logloss:0.35415	eval-logloss:0.37498
[6]	train-logloss:0.32122	eval-logloss:0.34571
[7]	train-logloss:0.29259	eval-logloss:0.32053
[8]	train-logloss:0.26747	eval-logloss:0.29721
[9]	train-logloss:0.24515	eval-logloss:0.27799
[10]	train-logloss:0.22569	eval-logloss:0.26030
[11]	train-logloss:0.20794	eval-logloss:0.24604
[12]	train-logloss:0.19218	eval-logloss:0.23156
[13]	train-logloss:0.17792	eval-logloss:0.22005
[14]	train-logloss:0.16522	eval-logloss:0.20857
[15]	train-logloss:0.15362	eval-logloss:0.19999
[16]	train-logloss:0.14333	eval-logloss:0.19012
[17]	train-logloss:0.13398	eval-logloss:0.18182
[18]	train-logloss:0.12560	eval-logloss:0.17473
[19]	train-logloss:0.11729	eval-logloss:0.16766
[20]	train-logloss:0.10969	eval-logloss:0.15820
[21]	train-logloss:0.10297	eval-logloss:0.15472
[22]	train-logloss:0.09707	eval-logloss:0.14895
[23]	train-logloss:0.09143	eval-logloss:0.14331
[24]	train-logloss:0.08634	eval-logloss:0.13634
[25]	train-logloss:0.08131	eval-logloss:0.13278
[26]	train-logloss:0.07686	eval-logloss:0.12791
[27]	train-logloss:0.07284	eval-logloss:0.12526
[28]	train-logloss:0.06925	eval-logloss:0.11998
[29]	train-logloss:0.06555	eval-logloss:0.11641
[30]	train-logloss:0.06241	eval-logloss:0.11450
[31]	train-logloss:0.05959	eval-logloss:0.11257
[32]	train-logloss:0.05710	eval-logloss:0.11154
[33]	train-logloss:0.05441	eval-logloss:0.10868
[34]	train-logloss:0.05204	eval-logloss:0.10668
[35]	train-logloss:0.04975	eval-logloss:0.10421
[36]	train-logloss:0.04775	eval-logloss:0.10296
[37]	train-logloss:0.04585	eval-logloss:0.10058
[38]	train-logloss:0.04401	eval-logloss:0.09868
[39]	train-logloss:0.04226	eval-logloss:0.09644
[40]	train-logloss:0.04065	eval-logloss:0.09587
[41]	train-logloss:0.03913	eval-logloss:0.09424
[42]	train-logloss:0.03738	eval-logloss:0.09471
[43]	train-logloss:0.03611	eval-logloss:0.09427
[44]	train-logloss:0.03494	eval-logloss:0.09389
[45]	train-logloss:0.03365	eval-logloss:0.09418
[46]	train-logloss:0.03253	eval-logloss:0.09402
[47]	train-logloss:0.03148	eval-logloss:0.09236
[48]	train-logloss:0.03039	eval-logloss:0.09301
[49]	train-logloss:0.02947	eval-logloss:0.09127
[50]	train-logloss:0.02854	eval-logloss:0.09005
[51]	train-logloss:0.02753	eval-logloss:0.08961
[52]	train-logloss:0.02656	eval-logloss:0.08958
[53]	train-logloss:0.02568	eval-logloss:0.09070
[54]	train-logloss:0.02500	eval-logloss:0.08958
[55]	train-logloss:0.02430	eval-logloss:0.09036
[56]	train-logloss:0.02357	eval-logloss:0.09159
[57]	train-logloss:0.02296	eval-logloss:0.09153
[58]	train-logloss:0.02249	eval-logloss:0.09199
[59]	train-logloss:0.02185	eval-logloss:0.09195
[60]	train-logloss:0.02132	eval-logloss:0.09194
[61]	train-logloss:0.02079	eval-logloss:0.09146
[62]	train-logloss:0.02022	eval-logloss:0.09031
[63]	train-logloss:0.01970	eval-logloss:0.08941
[64]	train-logloss:0.01918	eval-logloss:0.08972
[65]	train-logloss:0.01872	eval-logloss:0.08974
[66]	train-logloss:0.01833	eval-logloss:0.08962
[67]	train-logloss:0.01787	eval-logloss:0.08873
[68]	train-logloss:0.01760	eval-logloss:0.08862
[69]	train-logloss:0.01724	eval-logloss:0.08974
[70]	train-logloss:0.01688	eval-logloss:0.08998
[71]	train-logloss:0.01664	eval-logloss:0.08978
[72]	train-logloss:0.01629	eval-logloss:0.08958
[73]	train-logloss:0.01598	eval-logloss:0.08953
[74]	train-logloss:0.01566	eval-logloss:0.08875
[75]	train-logloss:0.01539	eval-logloss:0.08860
[76]	train-logloss:0.01515	eval-logloss:0.08812
[77]	train-logloss:0.01488	eval-logloss:0.08840
[78]	train-logloss:0.01464	eval-logloss:0.08874
[79]	train-logloss:0.01449	eval-logloss:0.08815
[80]	train-logloss:0.01418	eval-logloss:0.08758
[81]	train-logloss:0.01400	eval-logloss:0.08741
[82]	train-logloss:0.01377	eval-logloss:0.08849
[83]	train-logloss:0.01357	eval-logloss:0.08857
[84]	train-logloss:0.01341	eval-logloss:0.08807
[85]	train-logloss:0.01325	eval-logloss:0.08764
[86]	train-logloss:0.01311	eval-logloss:0.08742
[87]	train-logloss:0.01293	eval-logloss:0.08761
[88]	train-logloss:0.01271	eval-logloss:0.08707
[89]	train-logloss:0.01254	eval-logloss:0.08727
[90]	train-logloss:0.01235	eval-logloss:0.08716
[91]	train-logloss:0.01223	eval-logloss:0.08696
[92]	train-logloss:0.01206	eval-logloss:0.08717
[93]	train-logloss:0.01193	eval-logloss:0.08707
[94]	train-logloss:0.01182	eval-logloss:0.08659
[95]	train-logloss:0.01165	eval-logloss:0.08612
[96]	train-logloss:0.01148	eval-logloss:0.08714
[97]	train-logloss:0.01136	eval-logloss:0.08677
[98]	train-logloss:0.01124	eval-logloss:0.08669
[99]	train-logloss:0.01113	eval-logloss:0.08655
[100]	train-logloss:0.01100	eval-logloss:0.08650
[101]	train-logloss:0.01085	eval-logloss:0.08641
[102]	train-logloss:0.01075	eval-logloss:0.08629
[103]	train-logloss:0.01064	eval-logloss:0.08626
[104]	train-logloss:0.01050	eval-logloss:0.08683
[105]	train-logloss:0.01040	eval-logloss:0.08677
[106]	train-logloss:0.01030	eval-logloss:0.08732
[107]	train-logloss:0.01020	eval-logloss:0.08730
[108]	train-logloss:0.01007	eval-logloss:0.08728
[109]	train-logloss:0.01000	eval-logloss:0.08730
[110]	train-logloss:0.00991	eval-logloss:0.08729
[111]	train-logloss:0.00980	eval-logloss:0.08800
[112]	train-logloss:0.00971	eval-logloss:0.08794
[113]	train-logloss:0.00963	eval-logloss:0.08784
[114]	train-logloss:0.00956	eval-logloss:0.08807
[115]	train-logloss:0.00948	eval-logloss:0.08765
[116]	train-logloss:0.00942	eval-logloss:0.08730
[117]	train-logloss:0.00931	eval-logloss:0.08780
[118]	train-logloss:0.00923	eval-logloss:0.08775
[119]	train-logloss:0.00915	eval-logloss:0.08768
[120]	train-logloss:0.00912	eval-logloss:0.08763
[121]	train-logloss:0.00902	eval-logloss:0.08757
[122]	train-logloss:0.00897	eval-logloss:0.08755
[123]	train-logloss:0.00890	eval-logloss:0.08716
[124]	train-logloss:0.00884	eval-logloss:0.08767
[125]	train-logloss:0.00880	eval-logloss:0.08774
[126]	train-logloss:0.00871	eval-logloss:0.08827
[127]	train-logloss:0.00865	eval-logloss:0.08831
[128]	train-logloss:0.00861	eval-logloss:0.08827
[129]	train-logloss:0.00856	eval-logloss:0.08789
[130]	train-logloss:0.00846	eval-logloss:0.08886
[131]	train-logloss:0.00842	eval-logloss:0.08868
[132]	train-logloss:0.00839	eval-logloss:0.08874
[133]	train-logloss:0.00830	eval-logloss:0.08922
[134]	train-logloss:0.00827	eval-logloss:0.08918
[135]	train-logloss:0.00822	eval-logloss:0.08882
[136]	train-logloss:0.00816	eval-logloss:0.08851
[137]	train-logloss:0.00808	eval-logloss:0.08848
[138]	train-logloss:0.00805	eval-logloss:0.08839
[139]	train-logloss:0.00797	eval-logloss:0.08915
[140]	train-logloss:0.00795	eval-logloss:0.08911
[141]	train-logloss:0.00790	eval-logloss:0.08876
[142]	train-logloss:0.00787	eval-logloss:0.08868
[143]	train-logloss:0.00785	eval-logloss:0.08839
[144]	train-logloss:0.00778	eval-logloss:0.08927
[145]	train-logloss:0.00775	eval-logloss:0.08924
[146]	train-logloss:0.00773	eval-logloss:0.08914
[147]	train-logloss:0.00769	eval-logloss:0.08891
[148]	train-logloss:0.00762	eval-logloss:0.08942
[149]	train-logloss:0.00760	eval-logloss:0.08939
[150]	train-logloss:0.00757	eval-logloss:0.08911
[151]	train-logloss:0.00752	eval-logloss:0.08873
[152]	train-logloss:0.00750	eval-logloss:0.08872
[153]	train-logloss:0.00746	eval-logloss:0.08848
[154]	train-logloss:0.00741	eval-logloss:0.08847
[155]	train-logloss:0.00739	eval-logloss:0.08855
[156]	train-logloss:0.00737	eval-logloss:0.08852
[157]	train-logloss:0.00735	eval-logloss:0.08855
[158]	train-logloss:0.00732	eval-logloss:0.08827
[159]	train-logloss:0.00730	eval-logloss:0.08830
[160]	train-logloss:0.00728	eval-logloss:0.08828
[161]	train-logloss:0.00726	eval-logloss:0.08801
[162]	train-logloss:0.00724	eval-logloss:0.08776
[163]	train-logloss:0.00722	eval-logloss:0.08778
[164]	train-logloss:0.00720	eval-logloss:0.08778
[165]	train-logloss:0.00718	eval-logloss:0.08752
[166]	train-logloss:0.00716	eval-logloss:0.08754
[167]	train-logloss:0.00714	eval-logloss:0.08764
[168]	train-logloss:0.00712	eval-logloss:0.08739
[169]	train-logloss:0.00710	eval-logloss:0.08738
[170]	train-logloss:0.00708	eval-logloss:0.08730
[171]	train-logloss:0.00707	eval-logloss:0.08737
[172]	train-logloss:0.00705	eval-logloss:0.08740
[173]	train-logloss:0.00703	eval-logloss:0.08739
[174]	train-logloss:0.00701	eval-logloss:0.08713
[175]	train-logloss:0.00699	eval-logloss:0.08716
[176]	train-logloss:0.00697	eval-logloss:0.08695
[177]	train-logloss:0.00695	eval-logloss:0.08705
[178]	train-logloss:0.00694	eval-logloss:0.08697
[179]	train-logloss:0.00692	eval-logloss:0.08697
[180]	train-logloss:0.00690	eval-logloss:0.08704
[181]	train-logloss:0.00688	eval-logloss:0.08680
[182]	train-logloss:0.00687	eval-logloss:0.08683
[183]	train-logloss:0.00685	eval-logloss:0.08658
[184]	train-logloss:0.00683	eval-logloss:0.08659
[185]	train-logloss:0.00681	eval-logloss:0.08661
[186]	train-logloss:0.00680	eval-logloss:0.08637
[187]	train-logloss:0.00678	eval-logloss:0.08637
[188]	train-logloss:0.00676	eval-logloss:0.08630
[189]	train-logloss:0.00675	eval-logloss:0.08610
[190]	train-logloss:0.00673	eval-logloss:0.08602
[191]	train-logloss:0.00671	eval-logloss:0.08605
[192]	train-logloss:0.00670	eval-logloss:0.08615
[193]	train-logloss:0.00668	eval-logloss:0.08592
[194]	train-logloss:0.00667	eval-logloss:0.08591
[195]	train-logloss:0.00665	eval-logloss:0.08598
[196]	train-logloss:0.00663	eval-logloss:0.08601
[197]	train-logloss:0.00662	eval-logloss:0.08592
[198]	train-logloss:0.00660	eval-logloss:0.08585
[199]	train-logloss:0.00659	eval-logloss:0.08587
[200]	train-logloss:0.00657	eval-logloss:0.08589
[201]	train-logloss:0.00656	eval-logloss:0.08595
[202]	train-logloss:0.00654	eval-logloss:0.08573
[203]	train-logloss:0.00653	eval-logloss:0.08573
[204]	train-logloss:0.00651	eval-logloss:0.08575
[205]	train-logloss:0.00650	eval-logloss:0.08582
[206]	train-logloss:0.00648	eval-logloss:0.08584
[207]	train-logloss:0.00647	eval-logloss:0.08578
[208]	train-logloss:0.00645	eval-logloss:0.08569
[209]	train-logloss:0.00644	eval-logloss:0.08571
[210]	train-logloss:0.00643	eval-logloss:0.08581
[211]	train-logloss:0.00641	eval-logloss:0.08559
[212]	train-logloss:0.00640	eval-logloss:0.08580
[213]	train-logloss:0.00639	eval-logloss:0.08581
[214]	train-logloss:0.00637	eval-logloss:0.08574
[215]	train-logloss:0.00636	eval-logloss:0.08566
[216]	train-logloss:0.00635	eval-logloss:0.08584
[217]	train-logloss:0.00633	eval-logloss:0.08563
[218]	train-logloss:0.00632	eval-logloss:0.08573
[219]	train-logloss:0.00631	eval-logloss:0.08578
[220]	train-logloss:0.00629	eval-logloss:0.08579
[221]	train-logloss:0.00628	eval-logloss:0.08582
[222]	train-logloss:0.00627	eval-logloss:0.08576
[223]	train-logloss:0.00626	eval-logloss:0.08567
[224]	train-logloss:0.00624	eval-logloss:0.08586
[225]	train-logloss:0.00623	eval-logloss:0.08587
[226]	train-logloss:0.00622	eval-logloss:0.08593
[227]	train-logloss:0.00621	eval-logloss:0.08595
[228]	train-logloss:0.00619	eval-logloss:0.08587
[229]	train-logloss:0.00618	eval-logloss:0.08606
[230]	train-logloss:0.00617	eval-logloss:0.08600
[231]	train-logloss:0.00616	eval-logloss:0.08592
[232]	train-logloss:0.00615	eval-logloss:0.08610
[233]	train-logloss:0.00614	eval-logloss:0.08611
[234]	train-logloss:0.00612	eval-logloss:0.08617
[235]	train-logloss:0.00611	eval-logloss:0.08626
[236]	train-logloss:0.00610	eval-logloss:0.08629
[237]	train-logloss:0.00609	eval-logloss:0.08622
[238]	train-logloss:0.00608	eval-logloss:0.08639
[239]	train-logloss:0.00607	eval-logloss:0.08634
[240]	train-logloss:0.00606	eval-logloss:0.08618
[241]	train-logloss:0.00605	eval-logloss:0.08620
[242]	train-logloss:0.00604	eval-logloss:0.08625
[243]	train-logloss:0.00602	eval-logloss:0.08626
[244]	train-logloss:0.00601	eval-logloss:0.08629
[245]	train-logloss:0.00600	eval-logloss:0.08622
[246]	train-logloss:0.00599	eval-logloss:0.08640
[247]	train-logloss:0.00598	eval-logloss:0.08635
[248]	train-logloss:0.00597	eval-logloss:0.08628
[249]	train-logloss:0.00596	eval-logloss:0.08645
[250]	train-logloss:0.00595	eval-logloss:0.08629
[251]	train-logloss:0.00594	eval-logloss:0.08631
[252]	train-logloss:0.00593	eval-logloss:0.08636
[253]	train-logloss:0.00592	eval-logloss:0.08639
[254]	train-logloss:0.00591	eval-logloss:0.08649
[255]	train-logloss:0.00590	eval-logloss:0.08644
[256]	train-logloss:0.00589	eval-logloss:0.08629
[257]	train-logloss:0.00588	eval-logloss:0.08646
[258]	train-logloss:0.00587	eval-logloss:0.08639
[259]	train-logloss:0.00586	eval-logloss:0.08644
[260]	train-logloss:0.00585	eval-logloss:0.08646
[261]	train-logloss:0.00585	eval-logloss:0.08649
[262]	train-logloss:0.00584	eval-logloss:0.08645
[263]	train-logloss:0.00583	eval-logloss:0.08647
[264]	train-logloss:0.00582	eval-logloss:0.08632
[265]	train-logloss:0.00581	eval-logloss:0.08649
[266]	train-logloss:0.00580	eval-logloss:0.08654
[267]	train-logloss:0.00579	eval-logloss:0.08647
[268]	train-logloss:0.00578	eval-logloss:0.08650
[269]	train-logloss:0.00577	eval-logloss:0.08652
[270]	train-logloss:0.00576	eval-logloss:0.08669
[271]	train-logloss:0.00576	eval-logloss:0.08674
[272]	train-logloss:0.00575	eval-logloss:0.08683
[273]	train-logloss:0.00574	eval-logloss:0.08668
[274]	train-logloss:0.00573	eval-logloss:0.08664
[275]	train-logloss:0.00572	eval-logloss:0.08650
[276]	train-logloss:0.00571	eval-logloss:0.08635
[277]	train-logloss:0.00570	eval-logloss:0.08652
[278]	train-logloss:0.00570	eval-logloss:0.08657
[279]	train-logloss:0.00569	eval-logloss:0.08659
[280]	train-logloss:0.00568	eval-logloss:0.08668
[281]	train-logloss:0.00567	eval-logloss:0.08664
[282]	train-logloss:0.00566	eval-logloss:0.08650
[283]	train-logloss:0.00565	eval-logloss:0.08636
[284]	train-logloss:0.00565	eval-logloss:0.08640
[285]	train-logloss:0.00564	eval-logloss:0.08643
[286]	train-logloss:0.00563	eval-logloss:0.08646
[287]	train-logloss:0.00562	eval-logloss:0.08650
[288]	train-logloss:0.00562	eval-logloss:0.08637
[289]	train-logloss:0.00561	eval-logloss:0.08646
[290]	train-logloss:0.00560	eval-logloss:0.08645
[291]	train-logloss:0.00559	eval-logloss:0.08632
[292]	train-logloss:0.00558	eval-logloss:0.08628
[293]	train-logloss:0.00558	eval-logloss:0.08615
[294]	train-logloss:0.00557	eval-logloss:0.08620
[295]	train-logloss:0.00556	eval-logloss:0.08622
[296]	train-logloss:0.00556	eval-logloss:0.08631
[297]	train-logloss:0.00555	eval-logloss:0.08618
[298]	train-logloss:0.00554	eval-logloss:0.08626
[299]	train-logloss:0.00553	eval-logloss:0.08613
[300]	train-logloss:0.00553	eval-logloss:0.08618
[301]	train-logloss:0.00552	eval-logloss:0.08605
[302]	train-logloss:0.00551	eval-logloss:0.08602
[303]	train-logloss:0.00551	eval-logloss:0.08610
[304]	train-logloss:0.00550	eval-logloss:0.08598
[305]	train-logloss:0.00549	eval-logloss:0.08606
[306]	train-logloss:0.00548	eval-logloss:0.08597
[307]	train-logloss:0.00548	eval-logloss:0.08600
[308]	train-logloss:0.00547	eval-logloss:0.08600
[309]	train-logloss:0.00546	eval-logloss:0.08588
[310]	train-logloss:0.00546	eval-logloss:0.08592
#sklearn.predict는 0 or 1
#xgboost.predict percentage ->0, 1
pred_probs = xgb_model.predict(dtest)
preds = [1 if x > 0.5 else 0 for x in pred_probs]
print(preds[:10])
[1, 0, 1, 0, 1, 1, 1, 1, 1, 0]
def get_clf_eval(y_test, pred=None, pred_prob=None):
    confusion = confusion_matrix(y_test,pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test,pred)
    roc_auc = roc_auc_score(y_test, pred_prob)
    
    print('confusion matrix')
    print(confusion)
    print('accuracy: {0:.4f}, precision: {1:.4f}, recall(sensitivity): {2:.4f}, f1_score: {3:.4f}, AUC: {4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))
get_clf_eval(y_test, preds, pred_probs)
confusion matrix
[[35  2]
 [ 1 76]]
accuracy: 0.9737, precision: 0.9744, recall(sensitivity): 0.9870, f1_score: 0.9806, AUC: 0.9951
#f0 = the first feature, f1= the second feature
fig, ax = plt.subplots(figsize=(10,12))
plot_importance(xgb_model, ax=ax)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

LightGBM

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.

Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise.

So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm.

XGBOOST has become a de-facto algorithm for winning competitions at Kaggle, simply because it is extremely powerful. However, XGBOOST takes a long time to train (The execution time of XGBoost is slower when compared to that of LightGBM).

Pros and Cons

Pros:

Faster training speed and higher efficiency: Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.

Lower memory usage: Replaces continuous values to discrete bins which result in lower memory usage.

Better accuracy than any other boosting algorithm: It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy. However, it can sometimes lead to overfitting which can be avoided by setting the max_depth parameter.

Compatibility with Large Datasets: It is capable of performing equally good with large datasets with a significant reduction in training time as compared to XGBOOST.

Cons:

If we have only a few data (less than 10,000), overfitting.

Structural Differences in LightGBM & XGBoost

LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split. Here instances are observations/samples.

Parameters

Tuning for overfitting

num_iterations: Same as n_estimators in XGBoost.

num_leaves: the number of leaf nodes to use. Having a large number of leaves will improve accuracy, but will also lead to overfitting.

min_child_samples: the minimum number of samples (data) to group into a leaf. The parameter can greatly assist with overfitting: larger sample sizes per leaf will reduce overfitting (but may lead to under-fitting).

max_depth(default = -1): controls the depth of the tree explicitly. Shallower trees reduce overfitting. -1 represents unlimited. Since it splits the tree leaf wise with the best fit, it is deeper than other algorithms.

The other tunings

max_bin: the maximum numbers of bins that feature values are bucketed in. A smaller max_binreduces overfitting.

min_child_weight: the minimum sum hessian for a leaf. In conjuction with min_child_samples, larger values reduce overfitting.

bagging_fraction and bagging_freq: enables bagging (subsampling) of the training data. Both values need to be set for bagging to be used. The frequency controls how often (iteration) bagging is used. Smaller fractions and frequencies reduce overfitting.

feature_fraction: controls the subsampling of features used for training (as opposed to subsampling the actual training data in the case of bagging). Smaller fractions reduce overfitting.

lambda_l1 and lambda_l2: controls L1 and L2 regularization.

Tuning for imbalanced data

scale_pos_weight: the weight can be calculated based on the number of negative and positive examples: sample_pos_weight = number of negative samples / number of positive samples.

Tuning for accuracy

max_bin: a larger max_bin increases accuracy.

learning_rate: using a smaller learning rate and increasing the number of iterations may improve accuracy.

num_leaves: increasing the number of leaves increases accuracy with a high risk of overfitting.

from lightgbm import LGBMClassifier

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

dataset = load_breast_cancer()
ftr = dataset.data
target = dataset.target
X_train, X_test, y_train, y_test = train_test_split(ftr, target, test_size=0.2, random_state=140)

lgbm_wrapper = LGBMClassifier(n_estimators = 400)

evals = [(X_test, y_test)]
lgbm_wrapper.fit(X_train, y_train, early_stopping_rounds=100, eval_metric='logloss',
                eval_set=evals, verbose=True)
preds = lgbm_wrapper.predict(X_test)
pred_proba = lgbm_wrapper.predict_proba(X_test)[:,1]
[1]	valid_0's binary_logloss: 0.584137
[2]	valid_0's binary_logloss: 0.531865
[3]	valid_0's binary_logloss: 0.489679
[4]	valid_0's binary_logloss: 0.445948
[5]	valid_0's binary_logloss: 0.412324
[6]	valid_0's binary_logloss: 0.38592
[7]	valid_0's binary_logloss: 0.360361
[8]	valid_0's binary_logloss: 0.337779
[9]	valid_0's binary_logloss: 0.313687
[10]	valid_0's binary_logloss: 0.297034
[11]	valid_0's binary_logloss: 0.280963
[12]	valid_0's binary_logloss: 0.269252
[13]	valid_0's binary_logloss: 0.259373
[14]	valid_0's binary_logloss: 0.24568
[15]	valid_0's binary_logloss: 0.234193
[16]	valid_0's binary_logloss: 0.224921
[17]	valid_0's binary_logloss: 0.216948
[18]	valid_0's binary_logloss: 0.208808
[19]	valid_0's binary_logloss: 0.203406
[20]	valid_0's binary_logloss: 0.197933
[21]	valid_0's binary_logloss: 0.191554
[22]	valid_0's binary_logloss: 0.187306
[23]	valid_0's binary_logloss: 0.182242
[24]	valid_0's binary_logloss: 0.179387
[25]	valid_0's binary_logloss: 0.176955
[26]	valid_0's binary_logloss: 0.170377
[27]	valid_0's binary_logloss: 0.168472
[28]	valid_0's binary_logloss: 0.167712
[29]	valid_0's binary_logloss: 0.16591
[30]	valid_0's binary_logloss: 0.163481
[31]	valid_0's binary_logloss: 0.160653
[32]	valid_0's binary_logloss: 0.159134
[33]	valid_0's binary_logloss: 0.158466
[34]	valid_0's binary_logloss: 0.15636
[35]	valid_0's binary_logloss: 0.157044
[36]	valid_0's binary_logloss: 0.157827
[37]	valid_0's binary_logloss: 0.157982
[38]	valid_0's binary_logloss: 0.159488
[39]	valid_0's binary_logloss: 0.157836
[40]	valid_0's binary_logloss: 0.15951
[41]	valid_0's binary_logloss: 0.161029
[42]	valid_0's binary_logloss: 0.161577
[43]	valid_0's binary_logloss: 0.163691
[44]	valid_0's binary_logloss: 0.164406
[45]	valid_0's binary_logloss: 0.167129
[46]	valid_0's binary_logloss: 0.1673
[47]	valid_0's binary_logloss: 0.167575
[48]	valid_0's binary_logloss: 0.16816
[49]	valid_0's binary_logloss: 0.170114
[50]	valid_0's binary_logloss: 0.170016
[51]	valid_0's binary_logloss: 0.173092
[52]	valid_0's binary_logloss: 0.174613
[53]	valid_0's binary_logloss: 0.174627
[54]	valid_0's binary_logloss: 0.175125
[55]	valid_0's binary_logloss: 0.174675
[56]	valid_0's binary_logloss: 0.174531
[57]	valid_0's binary_logloss: 0.174217
[58]	valid_0's binary_logloss: 0.175717
[59]	valid_0's binary_logloss: 0.178332
[60]	valid_0's binary_logloss: 0.180662
[61]	valid_0's binary_logloss: 0.18416
[62]	valid_0's binary_logloss: 0.18666
[63]	valid_0's binary_logloss: 0.188004
[64]	valid_0's binary_logloss: 0.190403
[65]	valid_0's binary_logloss: 0.191344
[66]	valid_0's binary_logloss: 0.192358
[67]	valid_0's binary_logloss: 0.195783
[68]	valid_0's binary_logloss: 0.195829
[69]	valid_0's binary_logloss: 0.19694
[70]	valid_0's binary_logloss: 0.200485
[71]	valid_0's binary_logloss: 0.201875
[72]	valid_0's binary_logloss: 0.201646
[73]	valid_0's binary_logloss: 0.206463
[74]	valid_0's binary_logloss: 0.207705
[75]	valid_0's binary_logloss: 0.212814
[76]	valid_0's binary_logloss: 0.21387
[77]	valid_0's binary_logloss: 0.216174
[78]	valid_0's binary_logloss: 0.219169
[79]	valid_0's binary_logloss: 0.2235
[80]	valid_0's binary_logloss: 0.226936
[81]	valid_0's binary_logloss: 0.226004
[82]	valid_0's binary_logloss: 0.225387
[83]	valid_0's binary_logloss: 0.225136
[84]	valid_0's binary_logloss: 0.225719
[85]	valid_0's binary_logloss: 0.225678
[86]	valid_0's binary_logloss: 0.227301
[87]	valid_0's binary_logloss: 0.229565
[88]	valid_0's binary_logloss: 0.231288
[89]	valid_0's binary_logloss: 0.233607
[90]	valid_0's binary_logloss: 0.235624
[91]	valid_0's binary_logloss: 0.238434
[92]	valid_0's binary_logloss: 0.24202
[93]	valid_0's binary_logloss: 0.242375
[94]	valid_0's binary_logloss: 0.245213
[95]	valid_0's binary_logloss: 0.243981
[96]	valid_0's binary_logloss: 0.244532
[97]	valid_0's binary_logloss: 0.249277
[98]	valid_0's binary_logloss: 0.249782
[99]	valid_0's binary_logloss: 0.252167
[100]	valid_0's binary_logloss: 0.253593
[101]	valid_0's binary_logloss: 0.254262
[102]	valid_0's binary_logloss: 0.255576
[103]	valid_0's binary_logloss: 0.256773
[104]	valid_0's binary_logloss: 0.258766
[105]	valid_0's binary_logloss: 0.263502
[106]	valid_0's binary_logloss: 0.265942
[107]	valid_0's binary_logloss: 0.268441
[108]	valid_0's binary_logloss: 0.268973
[109]	valid_0's binary_logloss: 0.270916
[110]	valid_0's binary_logloss: 0.273558
[111]	valid_0's binary_logloss: 0.272805
[112]	valid_0's binary_logloss: 0.273632
[113]	valid_0's binary_logloss: 0.274197
[114]	valid_0's binary_logloss: 0.274282
[115]	valid_0's binary_logloss: 0.27571
[116]	valid_0's binary_logloss: 0.277853
[117]	valid_0's binary_logloss: 0.278483
[118]	valid_0's binary_logloss: 0.279719
[119]	valid_0's binary_logloss: 0.2815
[120]	valid_0's binary_logloss: 0.283669
[121]	valid_0's binary_logloss: 0.284815
[122]	valid_0's binary_logloss: 0.287406
[123]	valid_0's binary_logloss: 0.289447
[124]	valid_0's binary_logloss: 0.290595
[125]	valid_0's binary_logloss: 0.293048
[126]	valid_0's binary_logloss: 0.298274
[127]	valid_0's binary_logloss: 0.299845
[128]	valid_0's binary_logloss: 0.300318
[129]	valid_0's binary_logloss: 0.301592
[130]	valid_0's binary_logloss: 0.305201
[131]	valid_0's binary_logloss: 0.305358
[132]	valid_0's binary_logloss: 0.306347
[133]	valid_0's binary_logloss: 0.306694
[134]	valid_0's binary_logloss: 0.307242
get_clf_eval(y_test, preds, pred_proba)
confusion matrix
[[36  4]
 [ 2 72]]
accuracy: 0.9474, precision: 0.9474, recall(sensitivity): 0.9730, f1_score: 0.9600, AUC: 0.9807
from lightgbm import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize = (10, 12))
plot_importance(lgbm_wrapper, ax = ax)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

Kernelized Support Vector Machines

Kernelized support vector machines (often just referred to as SVMs) are an extension that allows for more complex models that are not defined simply by hyperplanes in the input space.

  • Classification case implemented in SVC / Regression case implemented in SVR.

The kernel trick

Adding nonlinear features to the representation of our data can

make linear models much more powerful. However, often we don’t know which features to add, and adding many features (like all possible interactions in a 100-dimensional feature space) might make computation very expensive.

The kernel trick works by directly computing the distance (more precisely, the scalar products) of the data points for the expanded feature representation, without ever actually computing the expansion.

Two ways of the kernel

  • the polynomial kernel: Computes all possible polynomials up to a certain degree of the original features(like feature1^2 * feature2^5)

  • the radial basis function (RBF) kernel (Gaussian kernel): Considers all possible polynomials of all degrees, but the importance of the features decreases for higher degrees.

Pros and Cons

  • Advantage? :

    • SVMs allow for complex decision boundaries, even if the data has only a few features.

    • They work well on low-dimensional and high-dimensional data (i.e.,few and many features)

  • Disadvantage?

    • Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage. This is why, these days, most people instead use tree-based models such as random forests or gradient boosting (which require little or no preprocessing) in many applications.

    • SVM models are hard to inspect; it can be difficult to understand why a particular prediction was made, and it might be tricky to explain the model to a nonexpert.

Understanding SVMs

During training, the SVM learns how important each of the training data points is to represent the decision boundary between the two classes.

Only a subset of the training points matter for defining the decision boundary: the ones that lie on the

border between the classes (Support vectors).

  • To make a prediction for a new point, the distance to each of the support vectors is measured.

  • A classification decision is made based on the distances to the support vector, and the importance of the support vectors that was learned during training (stored in the dual_coef_ attribute of SVC).

  • The distance between data points is measured by the Gaussian kernel:

\[k_{rbf}(x_1, x_2) = \exp(-\gamma \| x_1 - x_2 \|^2)\]

where $\gamma$ is a parameter that controls the width of the Gaussian kernel.

from sklearn.svm import SVC
X, y = mglearn.tools.make_handcrafted_dataset()
svm = SVC(kernel='rbf', C=10, gamma=0.1).fit(X, y)
mglearn.plots.plot_2d_separator(svm, X, eps=.5)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
# plot support vectors
sv = svm.support_vectors_
# class labels of support vectors are given by the sign of the dual coefficients
sv_labels = svm.dual_coef_.ravel() > 0
mglearn.discrete_scatter(sv[:, 0], sv[:, 1], sv_labels, s=15, markeredgewidth=3)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
Text(0, 0.5, 'Feature 1')

  • The decision boundary is shown in black, and the support vectors are larger points with the wide outline.

  • In this case, the SVM yields a very smooth and nonlinear (not a straight line) boundary.

  • We adjusted two parameters here: the C parameter and the gamma parameter.

Parameters

  • Gamma: The inverse of the width of the Gaussia kernel.

    • The wider the radius of the Gaussian kernel, the further the influence of each training example.

    • A small gamma means a large radius for the Gaussian kernel, which means that many points are considered close by. This is reflected in very smooth decision boundaries on the left.

  • C: A regularization parameter similar to that used in the linear models. It limits the importance of each point.

    • As with the linear models, a small C means a very restricted model, where each data point can only have very limited influence.

    • Small C: the decision boundary looks nearly linear, with the misclassified points barely having any influence on the line.

    • Large C: it allows these points to have a stronger influence on the model and makes the decision boundary bend to correctly classify them.

Gamma and C both control the complexity of the model, with large values in either resulting in a more complex model.

fig, axes = plt.subplots(3, 3, figsize=(15, 10))
for ax, C in zip(axes, [-1, 0, 3]):
    for a, gamma in zip(ax, range(-1, 2)):
        mglearn.plots.plot_svm(log_C=C, log_gamma=gamma, ax=a)
axes[0, 0].legend(["class 0", "class 1", "sv class 0", "sv class 1"],
ncol=4, loc=(.9, 1.2))
<matplotlib.legend.Legend at 0x7f9598677d90>

X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0)
svc = SVC()
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
Accuracy on training set: 0.90
Accuracy on test set: 0.94
plt.boxplot(X_train)
plt.yscale("symlog")
plt.xlabel("Feature index")
plt.ylabel("Feature magnitude")
Text(0, 0.5, 'Feature magnitude')

From this plot we can determine that features in the Breast Cancer dataset are of

completely different orders of magnitude. it has devastating effects for the kernel SVM.

Preprocessing data for SVMs

A common rescaling method for kernel SVMs is to scale the data such that all features are between 0 and 1: the MinMaxScaler preprocessing method.

# compute the minimum value per feature on the training set
min_on_training = X_train.min(axis=0)
# compute the range of each feature (max - min) on the training set
range_on_training = (X_train - min_on_training).max(axis=0)
# subtract the min, and divide by range
# afterward, min=0 and max=1 for each feature
X_train_scaled = (X_train - min_on_training) / range_on_training
print("Minimum for each feature\n{}".format(X_train_scaled.min(axis=0)))
print("Maximum for each feature\n {}".format(X_train_scaled.max(axis=0)))

# use THE SAME transformation on the test set,
# using min and range of the training set 
X_test_scaled = (X_test - min_on_training) / range_on_training

svc = SVC()
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
Minimum for each feature
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
Maximum for each feature
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
Accuracy on training set: 0.984
Accuracy on test set: 0.972

Scaling the data made a huge difference!

svc = SVC(C=1000)
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
#Increasing C allows us NOT to improve the model significantly, resulting in 95.8% accuracy.
Accuracy on training set: 1.000
Accuracy on test set: 0.958

Leave a comment