Cross-Validation (CV)

Train/ Test split

To evaluate our supervised models, we split our dataset into a training set and a test set and fit our model on the training set in order to make predictions on the test set using the scoring method.

  • We are interested in how well our model can predict new data that was not trained.
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

iris_dataset = load_iris()

# Keys
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

# Feature_names: A list of the descriptions of features
print("Feature names: \n{}".format(iris_dataset['feature_names'])) 

# Shape of the data array X(n: 150, p: 4)
print("Shape of data: {}".format(iris_dataset['data'].shape))

# Target y: 0 means setosa, 1 means versicolor, and 2 means virginica.
print("Target:\n{}".format(iris_dataset['target']))

X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], test_size = 0.2, random_state=12, shuffle=True) #A fixed seed:  If we want the splits to be reproducible, 
Keys of iris_dataset: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Shape of data: (150, 4)
Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
  • Before splitting, the train_test_split function shuffles the dataset using a pseudorandom number generator.

  • It extracts 75% of the rows in the data as the training set and the remaining 25% of the data as the test set.

# Instantiate a model and fit it to the training set
logreg = LogisticRegression().fit(X_train, y_train)
# evaluate the model on the test set
print("Test set score: {:.2f}".format(logreg.score(X_test, y_test)))
Test set score: 0.97

But train/test split does have its dangers

  • what if the split we make isn’t random?
  • What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women, or only people at a certain age? (imagine a file ordered by one of these).
  • This will result in overfitting, even though we’re trying to avoid it! This is where cross-validation comes in.

For a better estimate of the generalization performance, instead of using a single split into a training and a validation set, we can use cross-validation to evaluate the performance of each parameter combination.

Cross-Validation

  • Cross-validation is a statistical method of evaluating generalization performance that is more stable and thorough than using a split into a training and a test set.
  • When performing five-fold cross-validation, we split our data into k subsets, and train on k-1 one of those subsets.

  • The first model is trained using the first fold as the test set, and the remaining folds (2–5) are used as the training set.

  • The model is built using the data in folds 2–5; the accuracy is evaluated on fold 1.

  • Benefits
    • When using cross-validation, each example will be in the test set exactly once: each example is in one of the folds, and each fold is the test set once. On the other hand, train_test_split performs a random split of the data.

    • We use our data more effectively. When using 10-fold cross-validation, we can use nine-tenths of the data (90%) to fit the model. More data will usually result in more accurate models.

  • Disadvantage
    • Increased computational cost: Roughly $k$ times slower than doing a single split of the data.

image-20230704120444768

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

iris_dataset = load_iris()
logreg = LogisticRegression()
scores = cross_val_score(logreg, iris.data, iris.target, cv= 5) #Function, X, y
print("Cross-validation scores: {}".format( np.round(scores, 3)))
# Compute the mean to summarize the cross-validation accuracy.
print("Average cross-validation score: {:.2f}".format(scores.mean()))
Cross-validation scores: [0.967 1.    0.933 0.967 1.   ]
Average cross-validation score: 0.97
  • Cross_val_score takes the dataset and applies cross-validation to split the data. Then, train a model using the specified estimator (e.g., logistic regression, decision tree, …) and measure the performance of the model (scoring parameter).

  • Using the mean cross-validation, we can conclude that we expect the model to be around 97% accurate on average.

  • There is a relatively high variance in the accuracy between folds, ranging from 100% accuracy to 93% accuracy.

    • The model is very dependent on the particular folds used for training because of the small size of the dataset.

How many folds?

  • The more folds we have, we will reduce the error due to the bias but increase the error due to variance.
  • The computational price would go up too. The more folds you have, the longer it would take to compute it, and you would need more memory.
  • With a lower number of folds, we’re reducing the error due to variance, but the error due to bias would be bigger. It’s would also computationally cheaper.
  • In big datasets, k=3 is usually advised. In smaller datasets, as I’ve mentioned before, it’s best to use LOOCV.

Leave One Out Cross Validation (LOOCV)

  • In this type of cross-validation, the number of folds (subsets) equals the number of observations we have in the dataset.
  • We only use one data point for testing and the remaining data points for training.
  • We then average ALL of these folds and build our model with the average. We then test the model against the last fold.
  • Because we would get a large number of training sets (equal to the number of samples), this method is very computationally expensive and should be used on small datasets.
from sklearn.model_selection import LeaveOneOut 
X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()
loo.get_n_splits(X)


for train_index, test_index in loo.split(X):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]
   print(X_train, X_test, y_train, y_test)
('TRAIN:', array([1]), 'TEST:', array([0]))
(array([[3, 4]]), array([[1, 2]]), array([2]), array([1]))
('TRAIN:', array([0]), 'TEST:', array([1]))
(array([[1, 2]]), array([[3, 4]]), array([1]), array([2]))

Stratified k-fold CV

  • Splitting the dataset into k folds by starting with the first one-k-th part of the data might not always be a good idea.
    • In classification tasks with imbalanced class distributions, we should prefer StratifiedKFold over KFold.

    • What if the test set would be only class 0, and the training set would be only classes 1 and 2? As the classes in training and test sets would be different for all three splits, the three-fold cross-validation accuracy would be zero on this dataset.

    • If 90% of your samples belong to class A and 10% of your samples belong to class B, then stratified cross-validation ensures that in each fold, 90% of samples belong to class A and 10% of samples belong to class B.

  • Scikit-learn does not use it for classification but rather uses stratified k-fold cross-validation.

  • For regression, Scikit-learn uses the standard k-fold cross-validation by default.

CV with groups

  • When there are groups in the data that are highly related, it is commonly used.

  • GroupKFold is a variation of k-fold that ensures that the same group is not represented in both testing and training sets.
  • Say you want to build a system to recognize emotions from pictures of faces, and you collect a dataset of pictures of 100 people where each person is captured multiple times, showing various emotions.

  • It is likely that pictures of the same person will be in both the training and the test set.

  • To accurately evaluate the generalization to new faces, we must therefore ensure that the training and test sets contain images of different people.
from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

Grid Search

import pandas as pd
import mglearn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

iris_dataset = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_dataset.data, iris_dataset.target, test_size = 0.2, random_state= 12)

dtree = DecisionTreeClassifier()

#Specify the parameters you want to search over using a dictionary.
param_grid = {'max_depth':[1, 2, 3], 'min_samples_split':[2, 3, 6]}

#Instantiate the GridSearchCV class with the model, the parameter grid, and cv strategy.
grid_dtree = GridSearchCV(dtree, param_grid, cv=5, refit = True) #refit = True: Default

#Fit a new model on the training dataset with the parameters.
grid_dtree.fit(X_train, y_train)

results_df = pd.DataFrame(grid_dtree.cv_results_)
print(results_df[ ['params', 'mean_test_score', 'rank_test_score'] ] )

scores = np.array(results_df.mean_test_score).reshape(3, 3)

# Plot the mean cross-validation scores
mglearn.tools.heatmap(scores, xlabel='max_depth', xticklabels=param_grid['max_depth'],
ylabel='min_samples_split', yticklabels=param_grid['min_samples_split'], cmap="viridis")
                                     params  mean_test_score  rank_test_score
0  {'max_depth': 1, 'min_samples_split': 2}         0.625000                7
1  {'max_depth': 1, 'min_samples_split': 3}         0.625000                7
2  {'max_depth': 1, 'min_samples_split': 6}         0.625000                7
3  {'max_depth': 2, 'min_samples_split': 2}         0.908333                4
4  {'max_depth': 2, 'min_samples_split': 3}         0.908333                4
5  {'max_depth': 2, 'min_samples_split': 6}         0.908333                4
6  {'max_depth': 3, 'min_samples_split': 2}         0.950000                1
7  {'max_depth': 3, 'min_samples_split': 3}         0.950000                1
8  {'max_depth': 3, 'min_samples_split': 6}         0.950000                1

</pre>

  • The min_samples_split parameter is searching over interesting values, but the max_dept parameter is not—or it could mean the max_dept parameter is not important.
#To evaluate how well the best found parameters generalize, we can call score on the test set.
print("Best parameters: {}".format(grid_dtree.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_dtree.best_score_))
print("Test set score: {:.2f}".format(grid_dtree.score(X_test, y_test)))
Best parameters: {'max_depth': 3, 'min_samples_split': 2}
Best cross-validation score: 0.95
Test set score: 0.93

Leave a comment