Widely Used Cross Validation Techniques

A short and simple blog on various cross validation techniques.


Introduction

Cross validation is a step in the process of building a machine learning model which helps us ensure that our model fits the data accurately and also ensure that we do not overfit.

When we do the training we divide the data into two sets, train and test. This is called as a hold-out set where we keep some data outside the training. This way we prevent the model from learning the test dataset. However, this method is not useful everywhere. Choosing the right technique is very important as the different datasets may require different cross-validations.

There are few more techniques,

  1. k-fold
  2. stratified
  3. hold-out 
  4. leave-one-out

K-fold Cross validation

Implementation,

from sklearn import model_selection

# read the data
df = pd.read_csv('path to file')

# add the column called the kfold in the end
df['kfold'] = -1

# randomize the data
df.sample(frac = 1).reset_index(drop = True)

# select the model
kf = model_selection.KFold(n_splits = 5)

# fill the kfold column
for fold, (trn_, val_) in enumerate(kf.split(X = df)):
    df.loc[val_, 'kfold'] = fold

df.to_csv('train_folds.csv', index = False)

Here we divide the data into no of little dataset (above code adds the column k-fold to the dataset) pretty simple technique. 

Stratified K-fold

Implementation,

from sklearn import model_selection

# getting the data
df = pd.read_csv('path to file')

# we create a new column called kfold and fill it with -1
df['kfold'] = -1

# randomize the rows of the data
df.sample(frac = 1).reset_index(drop = True)

# fetch targets
y = df.target.values

# initiate the model 
kf = model_selection.StratifiedKFold(n_splits = 5)

# fill the new kfold column
for fold, (trn_, val_) in enumerate(kf.split(X = df, y= y)):
    df.loc[val_, 'kfold'] = fold

df.to_csv('train_folds.csv', index = False)

Whenever we have skewed data where let’s say, we have 90% positive samples and 10% negative samples, we can’t use the K-fold technique cuz that would create some folds without any negative samples. 

So, in this situation, we can use the stratified K-fold technique. Here, we divide the dataset in such a way that each fold consists 90-10% ratio. We can do it using the target column (which would have 0-1 values for binary classification). We can use that column to divide our data like the above code. 

A simple rule is, for standard classification problems, use stratified k-fold blindly.

Hold-out Based Cross validation

When we have a large number of samples, we can use hold-out based technique. Hold-out based cross validation is simple as k-fold validation, here we divide the data into k-folds and then we hold just one fold and train the model on the rest of the folds. 

Leave-one-out

When we have a very little no of samples we can’t keep a large fold of data just for testing. This would lead us to have very little data to train on. Here, we can use k-fold cross validation, only this time the cross validation would have k = N where N is the no of samples. This means that we will be training on all the samples except for one sample. 

This is a computationally heavy technique but we have a very small dataset to begin with. So, it doesn’t matter much.  

Leave a Reply

Your email address will not be published. Required fields are marked *

*