Classification
Naive Bayes Classifiers
Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models but, they tend to be even faster in training.
-
Naive Bayes models are so efficient is that they learn parameters by looking at each feature individually and collect simple per-class statistics from each feature.
-
GaussianNBcan be applied to any continuous data, whileBernoulliNBassumes binary data andMultinomialNBassumes count data (that is, that each feature represents an integer count of something, like how often a word appears in a sentence). -
MultinomialNBtakes into account the average value of each feature for each class, whileGaussianNBstores the average value as well as the standard deviation of each feature for each class. -
MultinomialNBandBernoulliNBhave a single parameter, alpha, which controls model complexity. A large alpha means more smoothing, resulting in less complex models. -
GaussianNBis mostly used on very high-dimensional data, while the other two variants of naive Bayes are widely used for sparse count data such as text. -
The naive Bayes models share many of the strengths and weaknesses of the linear models. They are very fast to train and to predict, and the training procedure is easy to understand. The models work very well with high-dimensional sparse data and are relatively robust to the parameters.
Decision trees
Decision trees are widely used models for classification and regression tasks. Essentially,
they learn a hierarchy of if/else questions, leading to a decision.
-
root: the whole dataaset. -
Decision Node: Each node in the tree either represents a question. -
Leaf Node: a terminal node.
Learning a decision tree means learning the sequence of if/else questions that gets us to the true answer most quickly.
Pros and Cons
-
Advantage? :
-
The resulting model can easily be visualized and understood by nonexperts.
-
The algorithms are completely invariant to scaling of the data.
- As each feature is processed separately, and the possible splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of features is needed for decision tree algorithms.
-
-
Disadvantage? : Overfitting
-
The presence of pure leaves mean that a tree is 100% accurate on the training set; Each data point in the training set is in a leaf that has the correct majority class.
-
Even with the use of pre-pruning, they tend to overfit and provide poor generalization performance. In most applications, we usually use the ensemble methods.
-
-
Strategies to prevent overfitting
-
Pre-pruning: stopping the creation of the tree early
-
limiting the
maximum depth of the tree, limiting themaximum number of leaves, requiring aminimum number of points in a nodeto keep splitting it (max_depth, max_leaf_nodes, or min_samples_leaf). -
If we don’t restrict the depth of a decision tree, the tree can become arbitrarily deep and complex. Unpruned trees are therefore prone to overfitting and not generalizing well to new data.
-
-
Post-pruning: building the tree but then removing or collapsing nodes that contain little information.
-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mglearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
Accuracy on training set: 1.000 Accuracy on test set: 0.937
Let’s apply pre-pruning to the tree, which will stop developing
the tree before we perfectly fit to the training data. Limiting the
depth of the tree decreases overfitting.
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
Accuracy on training set: 0.988 Accuracy on test set: 0.951
Feature importance in trees
-
It is a number between 0 and 1 for each feature, where 0 means “not used at all” and 1 means “perfectly predicts the target.”
-
The feature importances always sum to 1.
print("Feature importances:\n{}".format(tree.feature_importances_))
Feature importances: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01019737 0.04839825 0. 0. 0.0024156 0. 0. 0. 0. 0. 0.72682851 0.0458159 0. 0. 0.0141577 0. 0.018188 0.1221132 0.01188548 0. ]
def plot_feature_importances_cancer(model):
n_features = cancer.data.shape[1]
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), cancer.feature_names)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.ylim(-1, n_features)
plot_feature_importances_cancer(tree)
The feature used in the top split (“worst radius”) is by far **the most
important feature.** This confirms our observation in analyzing the tree that the first level already separates the two classes fairly well.
However, if a feature has a
low valuein feature_importance_, it doesn’t mean that this feature is uninformative.
It only means that the feature was not picked by the tree, likely because another feature encodes the same information.
Ensembles of Decision Trees
Ensembles are methods that combine multiple machine learning models to create more powerful models.
Random forests
Random forests are one way to address this problem. A random forest
is essentially a collection of decision trees, where each tree is slightly different from the others.
The idea behind random forests is that each tree might do a relatively good job of predicting, but will likely overfit on part of the data. If we build many trees, all of which work well and overfit in different ways, we can reduce the amount of overfitting by averaging their results. This reduction in overfitting, while retaining the predictive power of the trees, can be shown using rigorous mathematics.
Pros and Cons
-
Advantage? :
-
They are very powerful, often work well without heavy tuning of the parameters, and don’t require scaling of the data.
-
While building random forests on large datasetsmight be somewhat time consuming, it can be parallelized across multiple CPU cores within a computer easily. You can set
n_jobs=-1to use all the cores in your computer.
-
-
Disadvantage?
- Random forests usually work well even on very large datasets, and training can easily be parallelized over many CPU cores within a powerful computer. However, random forests require more memory and are slower to train and to predict than linear models.
Random forests don’t tend to perform well on very high dimensional, sparse data, such as text data. For this kind of data, linear models might be more appropriate.
Process: Bagging (Bootstrap Aggregating)
-
To build a random forest model, you need to decide on the number of trees to build (the
n_estimatorsparameter). Let’s we build 10 trees. -
To build a tree, we first take what is called a
bootstrap sampleof our data. From ourn_samplesdata points, we repeatedly draw an example randomly with replacement (meaning the same sample can be picked multiple times). -
Decision tree is built based on this newly created dataset.
-
Instead of looking for the best test for each node, in each node the algorithm randomly selects a subset of the features, and it looks for the best possible test involving one of these features. The number of features that are selected is controlled by the
max_featuresparameter.
A critical parameter in this process is
max_features. If we set max_features ton_features, that means that each split can look at all features in the dataset, and no randomness will be injected in the feature selection (the randomness due to the bootstrapping remains, though).
-
a
high max_featuresmeans that the trees in the random forest will be quite similar, and they will be able to fit the data easily, using the most distinctive features. -
A
low max_featuresmeans that the trees in the random forest will be quite different, and that each tree might need to be very deep in order to fit the data well.
Prediction?
-
To make a prediction using the random forest, the algorithm first makes a prediction for every tree in the forest.
-
For regression, we can average these results to get our final prediction.
-
For classification, a soft voting strategy is used. Each algorithm makes a “soft” prediction, providing a probability for each possible output label. The probabilities predicted by all the trees are averaged, and the class with the highest probability is predicted (The average of each class probabilities in all trees).
Parameters
-
n_estimators: larger is always better. Averaging more trees will yield a more robust ensemble by reducing overfitting. However, there are diminishing returns, and more trees need more memory and more time to train.- A common rule of thumb is to build “as many as you have time/memory for.”
-
max_features: determines how random each tree is. A smaller max_features reduces overfitting.- A good rule of thumb to use the default values: max_features=sqrt(n_features) for classification and max_fea tures=n_features for regression.
-
Adding
max_featuresormax_leaf_nodesmight sometimes improve performance. It can also drastically reduce space and time requirements for training and prediction.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=42)
forest = RandomForestClassifier(n_estimators=5, random_state=2, n_jobs=-1)
forest.fit(X_train, y_train)
RandomForestClassifier(n_estimators=5, n_jobs=-1, random_state=2)
fig, axes = plt.subplots(2, 3, figsize=(20, 10))
for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
ax.set_title("Tree {}".format(i))
mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)
mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1],
alpha=.4)
axes[-1, -1].set_title("Random Forest")
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
[<matplotlib.lines.Line2D at 0x7f957940cf10>, <matplotlib.lines.Line2D at 0x7f957940ca60>]
-
The decision boundaries learned by the five trees are quite different.
-
Each of them makes some mistakes, as some of the training points that are plotted here were not actually included in the training sets of the trees, due to the bootstrap sampling.
-
The random forest overfits less than any of the trees individually, and provides a much more intuitive decision boundary.
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))
plot_feature_importances_cancer(forest)
Accuracy on training set: 1.000 Accuracy on test set: 0.972
Similarly to the single decision tree, the random forest also gives
a lot of importance to the “worst radius” feature, but it actually chooses “worst perimeter” to be the most informative feature overall. The randomness in building the random forest forces the algorithm to consider many possible explanations, the result being that the random forest captures a much broader picture of the data than a single tree.
GBM (Gradient boosting machines)
The gradient boosted regression tree is another ensemble method that combines multiple decision trees to create a more powerful model.
In contrast to the
random forest approach, gradient boosting works by building trees in a serial manner,
where each tree tries to correct the mistakes of the previous one.
The main idea behind gradient boosting is to combine many simple models (in this context known as weak learners), like shallow trees.
Each tree can only provide good predictions on part of the data, and so more and more trees are added to iteratively improve performance.
They are generally a bit more sensitive to parameter settings than random forests, but can provide better accuracy if the parameters
are set correctly.
Pros and Cons
-
Advantage? :
-
Gradient boosted decision trees are among the most powerful and widely used models for supervised learning.
-
Similarly to other tree-based models, the algorithm works well without scaling and on a mixture of binary and continuous features.
-
-
Disadvantage?
-
They require careful tuning of the parameters and may take a long time to train.
-
As with other tree-based models, it also often does not work well on high-dimensional sparse data.
-
Process
-
As both gradient boosting and random forests perform well on similar kinds of data, a common approach is
to first try random forests, which work quite robustly. -
If random forests work well but
prediction time is at a premium, or it is importantto squeeze out the last percentage of accuracyfrom the machine learning model, moving to gradient boosting often helps. -
If you want to apply gradient boosting to a
large-scale problem, it might be worth looking into thexgboost packageand its Python interface which atthe time of writing is faster(and sometimes easier to tune) than the scikit-learn implementation of gradient boosting on many datasets.
Parameter
-
Another important parameter of gradient boosting is the
learning_rate, which controls how strongly each tree tries to correct the mistakes of the previous trees.A higher learning ratemeans each tree can makestronger corrections, allowing for more complex models.
-
n_estimators(The number of trees) also increases the model complexity, as the model has more chances to correct mistakes on the training set. -
These two parameters are highly interconnected, as a
lower learning_ratemeans that more trees are needed to build a model of similar complexity. -
In contrast to random forests, where a
higher n_estimators valueis always better,increasing n_estimatorsin gradient boosting leads to amore complex model, which may lead tooverfitting. -
A common practice is to fit
n_estimatorsdepending on the time and memory budget, and then search overdifferent learning_rates. -
Another important parameter is
max_depth(or alternatively max_leaf_nodes), to reduce the complexity of each tree (Usuallymax_depthis often not deeper than five splits).
from sklearn.ensemble import GradientBoostingClassifier
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0)
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 1.000 Accuracy on test set: 0.965
As the training set accuracy is 100%, we are likely to be overfitting. - To reduce overfitting, we could either apply stronger pre-pruning by limiting the maximum depth or lower the learning rate:
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 0.991 Accuracy on test set: 0.972
In this case, lowering the maximum depth of the trees provided a significant
improvement of the model, while lowering the learning rate only increased the
generalization performance slightly. As we used 100 trees, it
is impractical to inspect them all, even if they are all of depth 1.
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)
plot_feature_importances_cancer(gbrt)
The feature importances of the gradient boosted trees are somewhat
similar to the feature importances of the random forests, though the gradient boosting
completely ignored some of the features.
XGBoost
XGBoost (or eXtreme Gradient Boost) optimizes the performance of algorithms, primarily decision trees, in a gradient boosting framework while minimizing overfitting/bias through regularization.
It is arguably the most powerful algorithm and is increasingly being used in all industries and in all problem domains. It is also a winning algorithm in many machine learning competitions. In fact, XGBoost was used in 17 out of 29 data science competitions on the Kaggle platform.
A key to its performance is its hyperparameters. While XGBoost is extremely easy to implement, the hard part is tuning the hyperparameters.
Important features of XGBoost include:
-
parallel processing capabilities for large dataset
-
can handle missing values
-
can handle imbalanced dataset
-
allows for regularization to prevent overfitting
-
has built-in cross-validation
-
Early stopping: An approach to training complex machine learning models to avoid overfitting. It works by stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations.
Pros and Cons
Pros:
A large and growing list of data scientist globally that are actively contributing to XGBoost open source development.
Usage on a wide range of applications, including solving problems in regression, classification and user-defined prediction challenges.
A library that was built from the ground up to be efficient, felxtible and portable
By using Greedy-Algorithm, Automatic pruning is possible. Therefore, Overfitting rarely occurs.
Cons:
Although the speed problem, which is a disadvantage of GBM, has been solved to extent, but it is still slow.
If Hyperparameter modify use GridSearchCV, Speed is very slow.
Parameters
Three types: General parameters (Guide the overall functioning), Booster parameters (Guide the individual booster (tree/regression) at each step), Task parameters (Guide the optimization performed)
Frequently tuned hyperparameters: you always tune the following parameters to optimize model performance.
-
n_estimators: the number of decision trees to be boosted. If n_estimator = 1, it means only 1 tree is generated, thus no boosting is at work. The default value is 100, but you can play with this number for optimal performance. -
subsample: the subsample ratio of the training sample (each tree). A subsample = 0.5 means that 50% of training data is used prior to growing a tree. The value can be any fraction but the default value is 1. -
max_depth(default: 6): it limits how deep each tree can grow. The default value is 6 but you can try other values if overfitting is an issue in your model. -
learning_rate (alias: eta): it is a regularization parameter that shrinks feature weights in each boosting step. The default value is 0.3 but people generally tune with values such as 0.01, 0.1, 0.2 etc. -
gamma (alias: min_split_loss): it’s another regularization parameter for tree pruning. It specifies the minimum loss reduction required to grow a tree. The default value is set at 0. -
reg_alpha (alias: alpha): L1 regularization parameter. Default is 0. -
reg_lambda (alias: lambda): L2 regularization parameter. Default is 1.
Special use hyperparameters
-
scale_pos_weight: This parameter is useful in case you have animbalanced dataset, particularly inclassification problems, where the proportion of one class is a small fraction of total observations (e.g. credit card fraud). The default value is 1, but you can use the following ratio: total negative instance (e.g. no-fraud)/ total positive instance (e.g. fraud). -
monotone_constraints: You can activate this parameter if you want to increase the constraint on the predictors, for example, a non-linear, increasing likelihood of credit-loan approval with a higher credit score. -
booster: you can choose what kind of booster method to use. You have three options: ‘dart’, ‘gbtree ’ (tree-based) and ‘gblinear ’ (Ridge regression). -
missing: it’s not missing value treatment exactly, it’s rather used to specify under what circumstances the algorithm should treat a value as missing (e.g. a negative value of the age of a customer certainly is impossible, thus the algorithm treats it as a missing value). -
eval_metric: it specifieswhat loss function to use, e.g MAE, MSE, RMSE for regressionandlog loss for classification.
Visualization about XGBoost
-
XGB_model.plot_importance: indicate the importance of a characteristic -
XGB_model.plot_tree: indicate the Decision Tree
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import xgboost as xgb
from xgboost import plot_importance
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score
import warnings
warnings.filterwarnings('ignore')
#version = 0.9
print(xgb.__version__)
1.7.5
dataset = load_breast_cancer()
X_features = dataset.data
y_label = dataset.target
cancer_df = pd.DataFrame(data=X_features,columns=dataset.feature_names)
cancer_df['target'] = y_label #adding up y label in the last column
cancer_df.head(3)
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.8 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.6 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
| 1 | 20.57 | 17.77 | 132.9 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.8 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
| 2 | 19.69 | 21.25 | 130.0 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.5 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
3 rows × 31 columns
print(dataset.target_names)
print(cancer_df['target'].value_counts()) #malignant = 1, benign=0
['malignant' 'benign'] 1 357 0 212 Name: target, dtype: int64
# Split data into Train (80%)/Test data (20%)
X_train, X_test, y_train, y_test = train_test_split(X_features, y_label, test_size=0.2, random_state=156)
print(X_train.shape, X_test.shape)
(455, 30) (114, 30)
As an extra step, we need to store data into a compatible DMatrix object for XGBoost compatibility.
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)
XGBoost supports a suite of evaluation metrics not limited to:
-
rmsefor root mean squared error. -
maefor mean absolute error. -
loglossfor binary logarithmic loss and “mlogloss” for multi-class log loss (cross entropy). -
errorfor classification error. -
aucfor area under ROC curve.
params = {'max_depth':3,
'eta': 0.1,
'objective':'binary:logistic',
'eval_metric':'logloss',
'early_stoppings':100
}
num_rounds = 400
We define our test set as ‘eval’ to use early stopping.
wlist = [(dtrain,'train'),(dtest,'eval')]
#xgb.train() 함수의 파라미터로 전달
xgb_model = xgb.train(params = params, dtrain=dtrain, num_boost_round=num_rounds, \
early_stopping_rounds=100, evals=wlist)
[16:10:09] WARNING: /Users/runner/work/xgboost/xgboost/python-package/build/temp.macosx-10.9-x86_64-cpython-38/xgboost/src/learner.cc:767:
Parameters: { "early_stoppings" } are not used.
[0] train-logloss:0.60969 eval-logloss:0.61352
[1] train-logloss:0.54080 eval-logloss:0.54784
[2] train-logloss:0.48375 eval-logloss:0.49425
[3] train-logloss:0.43446 eval-logloss:0.44799
[4] train-logloss:0.39055 eval-logloss:0.40911
[5] train-logloss:0.35415 eval-logloss:0.37498
[6] train-logloss:0.32122 eval-logloss:0.34571
[7] train-logloss:0.29259 eval-logloss:0.32053
[8] train-logloss:0.26747 eval-logloss:0.29721
[9] train-logloss:0.24515 eval-logloss:0.27799
[10] train-logloss:0.22569 eval-logloss:0.26030
[11] train-logloss:0.20794 eval-logloss:0.24604
[12] train-logloss:0.19218 eval-logloss:0.23156
[13] train-logloss:0.17792 eval-logloss:0.22005
[14] train-logloss:0.16522 eval-logloss:0.20857
[15] train-logloss:0.15362 eval-logloss:0.19999
[16] train-logloss:0.14333 eval-logloss:0.19012
[17] train-logloss:0.13398 eval-logloss:0.18182
[18] train-logloss:0.12560 eval-logloss:0.17473
[19] train-logloss:0.11729 eval-logloss:0.16766
[20] train-logloss:0.10969 eval-logloss:0.15820
[21] train-logloss:0.10297 eval-logloss:0.15472
[22] train-logloss:0.09707 eval-logloss:0.14895
[23] train-logloss:0.09143 eval-logloss:0.14331
[24] train-logloss:0.08634 eval-logloss:0.13634
[25] train-logloss:0.08131 eval-logloss:0.13278
[26] train-logloss:0.07686 eval-logloss:0.12791
[27] train-logloss:0.07284 eval-logloss:0.12526
[28] train-logloss:0.06925 eval-logloss:0.11998
[29] train-logloss:0.06555 eval-logloss:0.11641
[30] train-logloss:0.06241 eval-logloss:0.11450
[31] train-logloss:0.05959 eval-logloss:0.11257
[32] train-logloss:0.05710 eval-logloss:0.11154
[33] train-logloss:0.05441 eval-logloss:0.10868
[34] train-logloss:0.05204 eval-logloss:0.10668
[35] train-logloss:0.04975 eval-logloss:0.10421
[36] train-logloss:0.04775 eval-logloss:0.10296
[37] train-logloss:0.04585 eval-logloss:0.10058
[38] train-logloss:0.04401 eval-logloss:0.09868
[39] train-logloss:0.04226 eval-logloss:0.09644
[40] train-logloss:0.04065 eval-logloss:0.09587
[41] train-logloss:0.03913 eval-logloss:0.09424
[42] train-logloss:0.03738 eval-logloss:0.09471
[43] train-logloss:0.03611 eval-logloss:0.09427
[44] train-logloss:0.03494 eval-logloss:0.09389
[45] train-logloss:0.03365 eval-logloss:0.09418
[46] train-logloss:0.03253 eval-logloss:0.09402
[47] train-logloss:0.03148 eval-logloss:0.09236
[48] train-logloss:0.03039 eval-logloss:0.09301
[49] train-logloss:0.02947 eval-logloss:0.09127
[50] train-logloss:0.02854 eval-logloss:0.09005
[51] train-logloss:0.02753 eval-logloss:0.08961
[52] train-logloss:0.02656 eval-logloss:0.08958
[53] train-logloss:0.02568 eval-logloss:0.09070
[54] train-logloss:0.02500 eval-logloss:0.08958
[55] train-logloss:0.02430 eval-logloss:0.09036
[56] train-logloss:0.02357 eval-logloss:0.09159
[57] train-logloss:0.02296 eval-logloss:0.09153
[58] train-logloss:0.02249 eval-logloss:0.09199
[59] train-logloss:0.02185 eval-logloss:0.09195
[60] train-logloss:0.02132 eval-logloss:0.09194
[61] train-logloss:0.02079 eval-logloss:0.09146
[62] train-logloss:0.02022 eval-logloss:0.09031
[63] train-logloss:0.01970 eval-logloss:0.08941
[64] train-logloss:0.01918 eval-logloss:0.08972
[65] train-logloss:0.01872 eval-logloss:0.08974
[66] train-logloss:0.01833 eval-logloss:0.08962
[67] train-logloss:0.01787 eval-logloss:0.08873
[68] train-logloss:0.01760 eval-logloss:0.08862
[69] train-logloss:0.01724 eval-logloss:0.08974
[70] train-logloss:0.01688 eval-logloss:0.08998
[71] train-logloss:0.01664 eval-logloss:0.08978
[72] train-logloss:0.01629 eval-logloss:0.08958
[73] train-logloss:0.01598 eval-logloss:0.08953
[74] train-logloss:0.01566 eval-logloss:0.08875
[75] train-logloss:0.01539 eval-logloss:0.08860
[76] train-logloss:0.01515 eval-logloss:0.08812
[77] train-logloss:0.01488 eval-logloss:0.08840
[78] train-logloss:0.01464 eval-logloss:0.08874
[79] train-logloss:0.01449 eval-logloss:0.08815
[80] train-logloss:0.01418 eval-logloss:0.08758
[81] train-logloss:0.01400 eval-logloss:0.08741
[82] train-logloss:0.01377 eval-logloss:0.08849
[83] train-logloss:0.01357 eval-logloss:0.08857
[84] train-logloss:0.01341 eval-logloss:0.08807
[85] train-logloss:0.01325 eval-logloss:0.08764
[86] train-logloss:0.01311 eval-logloss:0.08742
[87] train-logloss:0.01293 eval-logloss:0.08761
[88] train-logloss:0.01271 eval-logloss:0.08707
[89] train-logloss:0.01254 eval-logloss:0.08727
[90] train-logloss:0.01235 eval-logloss:0.08716
[91] train-logloss:0.01223 eval-logloss:0.08696
[92] train-logloss:0.01206 eval-logloss:0.08717
[93] train-logloss:0.01193 eval-logloss:0.08707
[94] train-logloss:0.01182 eval-logloss:0.08659
[95] train-logloss:0.01165 eval-logloss:0.08612
[96] train-logloss:0.01148 eval-logloss:0.08714
[97] train-logloss:0.01136 eval-logloss:0.08677
[98] train-logloss:0.01124 eval-logloss:0.08669
[99] train-logloss:0.01113 eval-logloss:0.08655
[100] train-logloss:0.01100 eval-logloss:0.08650
[101] train-logloss:0.01085 eval-logloss:0.08641
[102] train-logloss:0.01075 eval-logloss:0.08629
[103] train-logloss:0.01064 eval-logloss:0.08626
[104] train-logloss:0.01050 eval-logloss:0.08683
[105] train-logloss:0.01040 eval-logloss:0.08677
[106] train-logloss:0.01030 eval-logloss:0.08732
[107] train-logloss:0.01020 eval-logloss:0.08730
[108] train-logloss:0.01007 eval-logloss:0.08728
[109] train-logloss:0.01000 eval-logloss:0.08730
[110] train-logloss:0.00991 eval-logloss:0.08729
[111] train-logloss:0.00980 eval-logloss:0.08800
[112] train-logloss:0.00971 eval-logloss:0.08794
[113] train-logloss:0.00963 eval-logloss:0.08784
[114] train-logloss:0.00956 eval-logloss:0.08807
[115] train-logloss:0.00948 eval-logloss:0.08765
[116] train-logloss:0.00942 eval-logloss:0.08730
[117] train-logloss:0.00931 eval-logloss:0.08780
[118] train-logloss:0.00923 eval-logloss:0.08775
[119] train-logloss:0.00915 eval-logloss:0.08768
[120] train-logloss:0.00912 eval-logloss:0.08763
[121] train-logloss:0.00902 eval-logloss:0.08757
[122] train-logloss:0.00897 eval-logloss:0.08755
[123] train-logloss:0.00890 eval-logloss:0.08716
[124] train-logloss:0.00884 eval-logloss:0.08767
[125] train-logloss:0.00880 eval-logloss:0.08774
[126] train-logloss:0.00871 eval-logloss:0.08827
[127] train-logloss:0.00865 eval-logloss:0.08831
[128] train-logloss:0.00861 eval-logloss:0.08827
[129] train-logloss:0.00856 eval-logloss:0.08789
[130] train-logloss:0.00846 eval-logloss:0.08886
[131] train-logloss:0.00842 eval-logloss:0.08868
[132] train-logloss:0.00839 eval-logloss:0.08874
[133] train-logloss:0.00830 eval-logloss:0.08922
[134] train-logloss:0.00827 eval-logloss:0.08918
[135] train-logloss:0.00822 eval-logloss:0.08882
[136] train-logloss:0.00816 eval-logloss:0.08851
[137] train-logloss:0.00808 eval-logloss:0.08848
[138] train-logloss:0.00805 eval-logloss:0.08839
[139] train-logloss:0.00797 eval-logloss:0.08915
[140] train-logloss:0.00795 eval-logloss:0.08911
[141] train-logloss:0.00790 eval-logloss:0.08876
[142] train-logloss:0.00787 eval-logloss:0.08868
[143] train-logloss:0.00785 eval-logloss:0.08839
[144] train-logloss:0.00778 eval-logloss:0.08927
[145] train-logloss:0.00775 eval-logloss:0.08924
[146] train-logloss:0.00773 eval-logloss:0.08914
[147] train-logloss:0.00769 eval-logloss:0.08891
[148] train-logloss:0.00762 eval-logloss:0.08942
[149] train-logloss:0.00760 eval-logloss:0.08939
[150] train-logloss:0.00757 eval-logloss:0.08911
[151] train-logloss:0.00752 eval-logloss:0.08873
[152] train-logloss:0.00750 eval-logloss:0.08872
[153] train-logloss:0.00746 eval-logloss:0.08848
[154] train-logloss:0.00741 eval-logloss:0.08847
[155] train-logloss:0.00739 eval-logloss:0.08855
[156] train-logloss:0.00737 eval-logloss:0.08852
[157] train-logloss:0.00735 eval-logloss:0.08855
[158] train-logloss:0.00732 eval-logloss:0.08827
[159] train-logloss:0.00730 eval-logloss:0.08830
[160] train-logloss:0.00728 eval-logloss:0.08828
[161] train-logloss:0.00726 eval-logloss:0.08801
[162] train-logloss:0.00724 eval-logloss:0.08776
[163] train-logloss:0.00722 eval-logloss:0.08778
[164] train-logloss:0.00720 eval-logloss:0.08778
[165] train-logloss:0.00718 eval-logloss:0.08752
[166] train-logloss:0.00716 eval-logloss:0.08754
[167] train-logloss:0.00714 eval-logloss:0.08764
[168] train-logloss:0.00712 eval-logloss:0.08739
[169] train-logloss:0.00710 eval-logloss:0.08738
[170] train-logloss:0.00708 eval-logloss:0.08730
[171] train-logloss:0.00707 eval-logloss:0.08737
[172] train-logloss:0.00705 eval-logloss:0.08740
[173] train-logloss:0.00703 eval-logloss:0.08739
[174] train-logloss:0.00701 eval-logloss:0.08713
[175] train-logloss:0.00699 eval-logloss:0.08716
[176] train-logloss:0.00697 eval-logloss:0.08695
[177] train-logloss:0.00695 eval-logloss:0.08705
[178] train-logloss:0.00694 eval-logloss:0.08697
[179] train-logloss:0.00692 eval-logloss:0.08697
[180] train-logloss:0.00690 eval-logloss:0.08704
[181] train-logloss:0.00688 eval-logloss:0.08680
[182] train-logloss:0.00687 eval-logloss:0.08683
[183] train-logloss:0.00685 eval-logloss:0.08658
[184] train-logloss:0.00683 eval-logloss:0.08659
[185] train-logloss:0.00681 eval-logloss:0.08661
[186] train-logloss:0.00680 eval-logloss:0.08637
[187] train-logloss:0.00678 eval-logloss:0.08637
[188] train-logloss:0.00676 eval-logloss:0.08630
[189] train-logloss:0.00675 eval-logloss:0.08610
[190] train-logloss:0.00673 eval-logloss:0.08602
[191] train-logloss:0.00671 eval-logloss:0.08605
[192] train-logloss:0.00670 eval-logloss:0.08615
[193] train-logloss:0.00668 eval-logloss:0.08592
[194] train-logloss:0.00667 eval-logloss:0.08591
[195] train-logloss:0.00665 eval-logloss:0.08598
[196] train-logloss:0.00663 eval-logloss:0.08601
[197] train-logloss:0.00662 eval-logloss:0.08592
[198] train-logloss:0.00660 eval-logloss:0.08585
[199] train-logloss:0.00659 eval-logloss:0.08587
[200] train-logloss:0.00657 eval-logloss:0.08589
[201] train-logloss:0.00656 eval-logloss:0.08595
[202] train-logloss:0.00654 eval-logloss:0.08573
[203] train-logloss:0.00653 eval-logloss:0.08573
[204] train-logloss:0.00651 eval-logloss:0.08575
[205] train-logloss:0.00650 eval-logloss:0.08582
[206] train-logloss:0.00648 eval-logloss:0.08584
[207] train-logloss:0.00647 eval-logloss:0.08578
[208] train-logloss:0.00645 eval-logloss:0.08569
[209] train-logloss:0.00644 eval-logloss:0.08571
[210] train-logloss:0.00643 eval-logloss:0.08581
[211] train-logloss:0.00641 eval-logloss:0.08559
[212] train-logloss:0.00640 eval-logloss:0.08580
[213] train-logloss:0.00639 eval-logloss:0.08581
[214] train-logloss:0.00637 eval-logloss:0.08574
[215] train-logloss:0.00636 eval-logloss:0.08566
[216] train-logloss:0.00635 eval-logloss:0.08584
[217] train-logloss:0.00633 eval-logloss:0.08563
[218] train-logloss:0.00632 eval-logloss:0.08573
[219] train-logloss:0.00631 eval-logloss:0.08578
[220] train-logloss:0.00629 eval-logloss:0.08579
[221] train-logloss:0.00628 eval-logloss:0.08582
[222] train-logloss:0.00627 eval-logloss:0.08576
[223] train-logloss:0.00626 eval-logloss:0.08567
[224] train-logloss:0.00624 eval-logloss:0.08586
[225] train-logloss:0.00623 eval-logloss:0.08587
[226] train-logloss:0.00622 eval-logloss:0.08593
[227] train-logloss:0.00621 eval-logloss:0.08595
[228] train-logloss:0.00619 eval-logloss:0.08587
[229] train-logloss:0.00618 eval-logloss:0.08606
[230] train-logloss:0.00617 eval-logloss:0.08600
[231] train-logloss:0.00616 eval-logloss:0.08592
[232] train-logloss:0.00615 eval-logloss:0.08610
[233] train-logloss:0.00614 eval-logloss:0.08611
[234] train-logloss:0.00612 eval-logloss:0.08617
[235] train-logloss:0.00611 eval-logloss:0.08626
[236] train-logloss:0.00610 eval-logloss:0.08629
[237] train-logloss:0.00609 eval-logloss:0.08622
[238] train-logloss:0.00608 eval-logloss:0.08639
[239] train-logloss:0.00607 eval-logloss:0.08634
[240] train-logloss:0.00606 eval-logloss:0.08618
[241] train-logloss:0.00605 eval-logloss:0.08620
[242] train-logloss:0.00604 eval-logloss:0.08625
[243] train-logloss:0.00602 eval-logloss:0.08626
[244] train-logloss:0.00601 eval-logloss:0.08629
[245] train-logloss:0.00600 eval-logloss:0.08622
[246] train-logloss:0.00599 eval-logloss:0.08640
[247] train-logloss:0.00598 eval-logloss:0.08635
[248] train-logloss:0.00597 eval-logloss:0.08628
[249] train-logloss:0.00596 eval-logloss:0.08645
[250] train-logloss:0.00595 eval-logloss:0.08629
[251] train-logloss:0.00594 eval-logloss:0.08631
[252] train-logloss:0.00593 eval-logloss:0.08636
[253] train-logloss:0.00592 eval-logloss:0.08639
[254] train-logloss:0.00591 eval-logloss:0.08649
[255] train-logloss:0.00590 eval-logloss:0.08644
[256] train-logloss:0.00589 eval-logloss:0.08629
[257] train-logloss:0.00588 eval-logloss:0.08646
[258] train-logloss:0.00587 eval-logloss:0.08639
[259] train-logloss:0.00586 eval-logloss:0.08644
[260] train-logloss:0.00585 eval-logloss:0.08646
[261] train-logloss:0.00585 eval-logloss:0.08649
[262] train-logloss:0.00584 eval-logloss:0.08645
[263] train-logloss:0.00583 eval-logloss:0.08647
[264] train-logloss:0.00582 eval-logloss:0.08632
[265] train-logloss:0.00581 eval-logloss:0.08649
[266] train-logloss:0.00580 eval-logloss:0.08654
[267] train-logloss:0.00579 eval-logloss:0.08647
[268] train-logloss:0.00578 eval-logloss:0.08650
[269] train-logloss:0.00577 eval-logloss:0.08652
[270] train-logloss:0.00576 eval-logloss:0.08669
[271] train-logloss:0.00576 eval-logloss:0.08674
[272] train-logloss:0.00575 eval-logloss:0.08683
[273] train-logloss:0.00574 eval-logloss:0.08668
[274] train-logloss:0.00573 eval-logloss:0.08664
[275] train-logloss:0.00572 eval-logloss:0.08650
[276] train-logloss:0.00571 eval-logloss:0.08635
[277] train-logloss:0.00570 eval-logloss:0.08652
[278] train-logloss:0.00570 eval-logloss:0.08657
[279] train-logloss:0.00569 eval-logloss:0.08659
[280] train-logloss:0.00568 eval-logloss:0.08668
[281] train-logloss:0.00567 eval-logloss:0.08664
[282] train-logloss:0.00566 eval-logloss:0.08650
[283] train-logloss:0.00565 eval-logloss:0.08636
[284] train-logloss:0.00565 eval-logloss:0.08640
[285] train-logloss:0.00564 eval-logloss:0.08643
[286] train-logloss:0.00563 eval-logloss:0.08646
[287] train-logloss:0.00562 eval-logloss:0.08650
[288] train-logloss:0.00562 eval-logloss:0.08637
[289] train-logloss:0.00561 eval-logloss:0.08646
[290] train-logloss:0.00560 eval-logloss:0.08645
[291] train-logloss:0.00559 eval-logloss:0.08632
[292] train-logloss:0.00558 eval-logloss:0.08628
[293] train-logloss:0.00558 eval-logloss:0.08615
[294] train-logloss:0.00557 eval-logloss:0.08620
[295] train-logloss:0.00556 eval-logloss:0.08622
[296] train-logloss:0.00556 eval-logloss:0.08631
[297] train-logloss:0.00555 eval-logloss:0.08618
[298] train-logloss:0.00554 eval-logloss:0.08626
[299] train-logloss:0.00553 eval-logloss:0.08613
[300] train-logloss:0.00553 eval-logloss:0.08618
[301] train-logloss:0.00552 eval-logloss:0.08605
[302] train-logloss:0.00551 eval-logloss:0.08602
[303] train-logloss:0.00551 eval-logloss:0.08610
[304] train-logloss:0.00550 eval-logloss:0.08598
[305] train-logloss:0.00549 eval-logloss:0.08606
[306] train-logloss:0.00548 eval-logloss:0.08597
[307] train-logloss:0.00548 eval-logloss:0.08600
[308] train-logloss:0.00547 eval-logloss:0.08600
[309] train-logloss:0.00546 eval-logloss:0.08588
[310] train-logloss:0.00546 eval-logloss:0.08592
#sklearn.predict는 0 or 1
#xgboost.predict percentage ->0, 1
pred_probs = xgb_model.predict(dtest)
preds = [1 if x > 0.5 else 0 for x in pred_probs]
print(preds[:10])
[1, 0, 1, 0, 1, 1, 1, 1, 1, 0]
def get_clf_eval(y_test, pred=None, pred_prob=None):
confusion = confusion_matrix(y_test,pred)
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test,pred)
roc_auc = roc_auc_score(y_test, pred_prob)
print('confusion matrix')
print(confusion)
print('accuracy: {0:.4f}, precision: {1:.4f}, recall(sensitivity): {2:.4f}, f1_score: {3:.4f}, AUC: {4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))
get_clf_eval(y_test, preds, pred_probs)
confusion matrix [[35 2] [ 1 76]] accuracy: 0.9737, precision: 0.9744, recall(sensitivity): 0.9870, f1_score: 0.9806, AUC: 0.9951
#f0 = the first feature, f1= the second feature
fig, ax = plt.subplots(figsize=(10,12))
plot_importance(xgb_model, ax=ax)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>
LightGBM
Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.
Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise.
So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm.
XGBOOST has become a de-facto algorithm for winning competitions at Kaggle, simply because it is extremely powerful. However, XGBOOST takes a long time to train (The execution time of XGBoost is slower when compared to that of LightGBM).
Pros and Cons
Pros:
Faster training speed and higher efficiency: Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.
Lower memory usage: Replaces continuous values to discrete bins which result in lower memory usage.
Better accuracy than any other boosting algorithm: It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy. However, it can sometimes lead to overfitting which can be avoided by setting the max_depth parameter.
Compatibility with Large Datasets: It is capable of performing equally good with large datasets with a significant reduction in training time as compared to XGBOOST.
Cons:
If we have only a few data (less than 10,000), overfitting.
Structural Differences in LightGBM & XGBoost
LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split. Here instances are observations/samples.
Parameters
Tuning for overfitting
num_iterations: Same as n_estimators in XGBoost.
num_leaves: the number of leaf nodes to use. Having a large number of leaves will improve accuracy, but will also lead to overfitting.
min_child_samples: the minimum number of samples (data) to group into a leaf. The parameter can greatly assist with overfitting: larger sample sizes per leaf will reduce overfitting (but may lead to under-fitting).
max_depth(default = -1): controls the depth of the tree explicitly. Shallower trees reduce overfitting. -1 represents unlimited. Since it splits the tree leaf wise with the best fit, it is deeper than other algorithms.
The other tunings
max_bin: the maximum numbers of bins that feature values are bucketed in. A smaller max_binreduces overfitting.
min_child_weight: the minimum sum hessian for a leaf. In conjuction with min_child_samples, larger values reduce overfitting.
bagging_fraction and bagging_freq: enables bagging (subsampling) of the training data. Both values need to be set for bagging to be used. The frequency controls how often (iteration) bagging is used. Smaller fractions and frequencies reduce overfitting.
feature_fraction: controls the subsampling of features used for training (as opposed to subsampling the actual training data in the case of bagging). Smaller fractions reduce overfitting.
lambda_l1 and lambda_l2: controls L1 and L2 regularization.
Tuning for imbalanced data
scale_pos_weight: the weight can be calculated based on the number of negative and positive examples: sample_pos_weight = number of negative samples / number of positive samples.
Tuning for accuracy
max_bin: a larger max_bin increases accuracy.
learning_rate: using a smaller learning rate and increasing the number of iterations may improve accuracy.
num_leaves: increasing the number of leaves increases accuracy with a high risk of overfitting.
from lightgbm import LGBMClassifier
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
dataset = load_breast_cancer()
ftr = dataset.data
target = dataset.target
X_train, X_test, y_train, y_test = train_test_split(ftr, target, test_size=0.2, random_state=140)
lgbm_wrapper = LGBMClassifier(n_estimators = 400)
evals = [(X_test, y_test)]
lgbm_wrapper.fit(X_train, y_train, early_stopping_rounds=100, eval_metric='logloss',
eval_set=evals, verbose=True)
preds = lgbm_wrapper.predict(X_test)
pred_proba = lgbm_wrapper.predict_proba(X_test)[:,1]
[1] valid_0's binary_logloss: 0.584137 [2] valid_0's binary_logloss: 0.531865 [3] valid_0's binary_logloss: 0.489679 [4] valid_0's binary_logloss: 0.445948 [5] valid_0's binary_logloss: 0.412324 [6] valid_0's binary_logloss: 0.38592 [7] valid_0's binary_logloss: 0.360361 [8] valid_0's binary_logloss: 0.337779 [9] valid_0's binary_logloss: 0.313687 [10] valid_0's binary_logloss: 0.297034 [11] valid_0's binary_logloss: 0.280963 [12] valid_0's binary_logloss: 0.269252 [13] valid_0's binary_logloss: 0.259373 [14] valid_0's binary_logloss: 0.24568 [15] valid_0's binary_logloss: 0.234193 [16] valid_0's binary_logloss: 0.224921 [17] valid_0's binary_logloss: 0.216948 [18] valid_0's binary_logloss: 0.208808 [19] valid_0's binary_logloss: 0.203406 [20] valid_0's binary_logloss: 0.197933 [21] valid_0's binary_logloss: 0.191554 [22] valid_0's binary_logloss: 0.187306 [23] valid_0's binary_logloss: 0.182242 [24] valid_0's binary_logloss: 0.179387 [25] valid_0's binary_logloss: 0.176955 [26] valid_0's binary_logloss: 0.170377 [27] valid_0's binary_logloss: 0.168472 [28] valid_0's binary_logloss: 0.167712 [29] valid_0's binary_logloss: 0.16591 [30] valid_0's binary_logloss: 0.163481 [31] valid_0's binary_logloss: 0.160653 [32] valid_0's binary_logloss: 0.159134 [33] valid_0's binary_logloss: 0.158466 [34] valid_0's binary_logloss: 0.15636 [35] valid_0's binary_logloss: 0.157044 [36] valid_0's binary_logloss: 0.157827 [37] valid_0's binary_logloss: 0.157982 [38] valid_0's binary_logloss: 0.159488 [39] valid_0's binary_logloss: 0.157836 [40] valid_0's binary_logloss: 0.15951 [41] valid_0's binary_logloss: 0.161029 [42] valid_0's binary_logloss: 0.161577 [43] valid_0's binary_logloss: 0.163691 [44] valid_0's binary_logloss: 0.164406 [45] valid_0's binary_logloss: 0.167129 [46] valid_0's binary_logloss: 0.1673 [47] valid_0's binary_logloss: 0.167575 [48] valid_0's binary_logloss: 0.16816 [49] valid_0's binary_logloss: 0.170114 [50] valid_0's binary_logloss: 0.170016 [51] valid_0's binary_logloss: 0.173092 [52] valid_0's binary_logloss: 0.174613 [53] valid_0's binary_logloss: 0.174627 [54] valid_0's binary_logloss: 0.175125 [55] valid_0's binary_logloss: 0.174675 [56] valid_0's binary_logloss: 0.174531 [57] valid_0's binary_logloss: 0.174217 [58] valid_0's binary_logloss: 0.175717 [59] valid_0's binary_logloss: 0.178332 [60] valid_0's binary_logloss: 0.180662 [61] valid_0's binary_logloss: 0.18416 [62] valid_0's binary_logloss: 0.18666 [63] valid_0's binary_logloss: 0.188004 [64] valid_0's binary_logloss: 0.190403 [65] valid_0's binary_logloss: 0.191344 [66] valid_0's binary_logloss: 0.192358 [67] valid_0's binary_logloss: 0.195783 [68] valid_0's binary_logloss: 0.195829 [69] valid_0's binary_logloss: 0.19694 [70] valid_0's binary_logloss: 0.200485 [71] valid_0's binary_logloss: 0.201875 [72] valid_0's binary_logloss: 0.201646 [73] valid_0's binary_logloss: 0.206463 [74] valid_0's binary_logloss: 0.207705 [75] valid_0's binary_logloss: 0.212814 [76] valid_0's binary_logloss: 0.21387 [77] valid_0's binary_logloss: 0.216174 [78] valid_0's binary_logloss: 0.219169 [79] valid_0's binary_logloss: 0.2235 [80] valid_0's binary_logloss: 0.226936 [81] valid_0's binary_logloss: 0.226004 [82] valid_0's binary_logloss: 0.225387 [83] valid_0's binary_logloss: 0.225136 [84] valid_0's binary_logloss: 0.225719 [85] valid_0's binary_logloss: 0.225678 [86] valid_0's binary_logloss: 0.227301 [87] valid_0's binary_logloss: 0.229565 [88] valid_0's binary_logloss: 0.231288 [89] valid_0's binary_logloss: 0.233607 [90] valid_0's binary_logloss: 0.235624 [91] valid_0's binary_logloss: 0.238434 [92] valid_0's binary_logloss: 0.24202 [93] valid_0's binary_logloss: 0.242375 [94] valid_0's binary_logloss: 0.245213 [95] valid_0's binary_logloss: 0.243981 [96] valid_0's binary_logloss: 0.244532 [97] valid_0's binary_logloss: 0.249277 [98] valid_0's binary_logloss: 0.249782 [99] valid_0's binary_logloss: 0.252167 [100] valid_0's binary_logloss: 0.253593 [101] valid_0's binary_logloss: 0.254262 [102] valid_0's binary_logloss: 0.255576 [103] valid_0's binary_logloss: 0.256773 [104] valid_0's binary_logloss: 0.258766 [105] valid_0's binary_logloss: 0.263502 [106] valid_0's binary_logloss: 0.265942 [107] valid_0's binary_logloss: 0.268441 [108] valid_0's binary_logloss: 0.268973 [109] valid_0's binary_logloss: 0.270916 [110] valid_0's binary_logloss: 0.273558 [111] valid_0's binary_logloss: 0.272805 [112] valid_0's binary_logloss: 0.273632 [113] valid_0's binary_logloss: 0.274197 [114] valid_0's binary_logloss: 0.274282 [115] valid_0's binary_logloss: 0.27571 [116] valid_0's binary_logloss: 0.277853 [117] valid_0's binary_logloss: 0.278483 [118] valid_0's binary_logloss: 0.279719 [119] valid_0's binary_logloss: 0.2815 [120] valid_0's binary_logloss: 0.283669 [121] valid_0's binary_logloss: 0.284815 [122] valid_0's binary_logloss: 0.287406 [123] valid_0's binary_logloss: 0.289447 [124] valid_0's binary_logloss: 0.290595 [125] valid_0's binary_logloss: 0.293048 [126] valid_0's binary_logloss: 0.298274 [127] valid_0's binary_logloss: 0.299845 [128] valid_0's binary_logloss: 0.300318 [129] valid_0's binary_logloss: 0.301592 [130] valid_0's binary_logloss: 0.305201 [131] valid_0's binary_logloss: 0.305358 [132] valid_0's binary_logloss: 0.306347 [133] valid_0's binary_logloss: 0.306694 [134] valid_0's binary_logloss: 0.307242
get_clf_eval(y_test, preds, pred_proba)
confusion matrix [[36 4] [ 2 72]] accuracy: 0.9474, precision: 0.9474, recall(sensitivity): 0.9730, f1_score: 0.9600, AUC: 0.9807
from lightgbm import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(figsize = (10, 12))
plot_importance(lgbm_wrapper, ax = ax)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>
Kernelized Support Vector Machines
Kernelized support vector machines (often just referred to as SVMs) are an extension that allows for more complex models that are not defined simply by hyperplanes in the input space.
- Classification case implemented in SVC / Regression case implemented in SVR.
The kernel trick
Adding nonlinear features to the representation of our data can
make linear models much more powerful. However, often we don’t know which features to add, and adding many features (like all possible interactions in a 100-dimensional feature space) might make computation very expensive.
The kernel trick works by directly computing the distance (more precisely, the scalar products) of the data points for the expanded feature representation, without ever actually computing the expansion.
Two ways of the kernel
-
the polynomial kernel: Computes all possible polynomials up to a certain degree of the original features(like feature1^2 * feature2^5) -
the radial basis function (RBF) kernel (Gaussian kernel): Considers all possible polynomials of all degrees, but the importance of the features decreases for higher degrees.
Pros and Cons
-
Advantage? :
-
SVMs allow for complex decision boundaries, even if the data has only a few features.
-
They work well on low-dimensional and high-dimensional data (i.e.,few and many features)
-
-
Disadvantage?
-
Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage. This is why, these days, most people instead use tree-based models such as random forests or gradient boosting (which require little or no preprocessing) in many applications.
-
SVM models are hard to inspect; it can be difficult to understand why a particular prediction was made, and it might be tricky to explain the model to a nonexpert.
-
Understanding SVMs
During training, the SVM learns how important each of the training data points is to represent the decision boundary between the two classes.
Only a subset of the training points matter for defining the decision boundary: the ones that lie on the
border between the classes (Support vectors).
-
To make a prediction for a new point, the distance to each of the support vectors is measured.
-
A classification decision is made based on the distances to the support vector, and the importance of the support vectors that was learned during training (stored in the dual_coef_ attribute of SVC).
-
The distance between data points is measured by the Gaussian kernel:
where $\gamma$ is a parameter that controls the width of the Gaussian kernel.
from sklearn.svm import SVC
X, y = mglearn.tools.make_handcrafted_dataset()
svm = SVC(kernel='rbf', C=10, gamma=0.1).fit(X, y)
mglearn.plots.plot_2d_separator(svm, X, eps=.5)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
# plot support vectors
sv = svm.support_vectors_
# class labels of support vectors are given by the sign of the dual coefficients
sv_labels = svm.dual_coef_.ravel() > 0
mglearn.discrete_scatter(sv[:, 0], sv[:, 1], sv_labels, s=15, markeredgewidth=3)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
Text(0, 0.5, 'Feature 1')
-
The decision boundary is shown in black, and the support vectors are larger points with the wide outline.
-
In this case, the SVM yields a very smooth and nonlinear (not a straight line) boundary.
-
We adjusted two parameters here: the C parameter and the gamma parameter.
Parameters
-
Gamma: The inverse of the width of the Gaussia kernel.-
The wider the radius of the Gaussian kernel, the further the influence of each training example.
-
A
small gammameans a large radius for the Gaussian kernel, which means that many points are considered close by. This is reflected in very smooth decision boundaries on the left.
-
-
C: A regularization parameter similar to that used in the linear models. It limits the importance of each point.-
As with the linear models, a small C means a very restricted model, where each data point can only have very limited influence.
-
Small C: the decision boundary looks nearly linear, with the misclassified points barely having any influence on the line. -
Large C: it allows these points to have a stronger influence on the model and makes the decision boundary bend to correctly classify them.
-
Gamma and C both control the complexity of the model, with large values in either resulting in a more complex model.
fig, axes = plt.subplots(3, 3, figsize=(15, 10))
for ax, C in zip(axes, [-1, 0, 3]):
for a, gamma in zip(ax, range(-1, 2)):
mglearn.plots.plot_svm(log_C=C, log_gamma=gamma, ax=a)
axes[0, 0].legend(["class 0", "class 1", "sv class 0", "sv class 1"],
ncol=4, loc=(.9, 1.2))
<matplotlib.legend.Legend at 0x7f9598677d90>
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0)
svc = SVC()
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
Accuracy on training set: 0.90 Accuracy on test set: 0.94
plt.boxplot(X_train)
plt.yscale("symlog")
plt.xlabel("Feature index")
plt.ylabel("Feature magnitude")
Text(0, 0.5, 'Feature magnitude')
From this plot we can determine that features in the Breast Cancer dataset are of
completely different orders of magnitude. it has devastating effects for the kernel SVM.
Preprocessing data for SVMs
A common rescaling method for kernel SVMs is to scale the data such that all features are between 0 and 1: the MinMaxScaler preprocessing method.
# compute the minimum value per feature on the training set
min_on_training = X_train.min(axis=0)
# compute the range of each feature (max - min) on the training set
range_on_training = (X_train - min_on_training).max(axis=0)
# subtract the min, and divide by range
# afterward, min=0 and max=1 for each feature
X_train_scaled = (X_train - min_on_training) / range_on_training
print("Minimum for each feature\n{}".format(X_train_scaled.min(axis=0)))
print("Maximum for each feature\n {}".format(X_train_scaled.max(axis=0)))
# use THE SAME transformation on the test set,
# using min and range of the training set
X_test_scaled = (X_test - min_on_training) / range_on_training
svc = SVC()
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
Minimum for each feature [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Maximum for each feature [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Accuracy on training set: 0.984 Accuracy on test set: 0.972
Scaling the data made a huge difference!
svc = SVC(C=1000)
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
#Increasing C allows us NOT to improve the model significantly, resulting in 95.8% accuracy.
Accuracy on training set: 1.000 Accuracy on test set: 0.958
Leave a comment