K-Fold With Scikit-Learn

Visualizing K-Fold Cross-Validation Behavior

Let’s look at how to implement K-Fold cross-validation using Scikit-Learn. To achieve this, we need to import the KFold class from sklearn.model_selection. Let’s look at the KFold class from Scikit-Learn, its parameters, and its methods.

sklearn.model_selection.KFold(n_splits=5, *, shuffle=False, random_state=None)

PARAMETERS:

n_splits (int, default=5): Number of folds. Must be at least 2.
shuffle (bool, default=False): Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled
random_state (int, default=None): When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect.
METHODS:

get_metadata_routing(): Get metadata routing of this object.
get_n_splits(X=None, y=None, groups=None): Returns the number of splitting iterations in the cross-validator. Here X,y and groups are objects.
split(X, y=None, groups=None): Generate indices to split data into training and test set. Here X is an array which holds number of samples and number of features, y is the target variable for supervised learning problems, groups is the samples used while splitting the data into training / test set.

Let’s create a synthetic regression dataset to analyse how the K-Fold split works. The code is as follows:

Python

import numpy as np
from sklearn import datasets
from sklearn.model_selection import KFold

# synthetic regression dataset
X, y = datasets.make_regression(
  n_samples=10, n_features=1, n_informative=1,
  noise=0, random_state=0)

# KFold split
kf = KFold(n_splits=4)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}:")
    print(f"  Training dataset index: {train_index}")
    print(f"  Test dataset index: {test_index}")

Output:

Fold 0:
  Training dataset index: [3 4 5 6 7 8 9]
  Test dataset index: [0 1 2]
Fold 1:
  Training dataset index: [0 1 2 6 7 8 9]
  Test dataset index: [3 4 5]
Fold 2:
  Training dataset index: [0 1 2 3 4 5 8 9]
  Test dataset index: [6 7]
Fold 3:
  Training dataset index: [0 1 2 3 4 5 6 7]
  Test dataset index: [8 9]

In the above code we created a synthetic regression dataset by using make_regression() method from sklearn. Here X is the input set and y is the target data (label). The KFold class divides the input data into four folds using the split() method. Hence, it has a total of four iterations (4 folds). Hope you noticed that for the entire iterations, the train index and test index are different, and it also considered the entire data for training. Let’s check the number of splits using the get_n_splits() method.

Python

kf.get_n_splits(X)

Output:

Cross-Validation Using K-Fold With Scikit-Learn

Cross-validation involves repeatedly splitting data into training and testing sets to evaluate the performance of a machine-learning model. One of the most commonly used cross-validation techniques is K-Fold Cross-Validation. In this article, we will explore the implementation of K-Fold Cross-Validation using Scikit-Learn, a popular Python machine-learning library.

Table of Content

What is K-Fold Cross Validation?
K-Fold With Scikit-Learn
Visualizing K-Fold Cross-Validation Behavior
Logistic Regression Model & K-Fold Cross Validating
Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)
Advantages & Disadvantages of K-Fold Cross Validation
Additional Information
Conclusions
Frequently Asked Questions (FAQs)

K-Fold With Scikit-Learn

Cross-Validation Using K-Fold With Scikit-Learn

Similar Reads