Last updated October 14, 2020
In AI Mysteries

Model Selection With K-fold Cross Validation — A Walkthrough with MachineHack’s Food Cost Prediction Hackathon

Share

Published on May 10, 2019

by Amal Nair

A Data Scientist goes through a lot of data that needs to be cleaned, pre-processed, modelled and visualised. All these processes may not be as simple as it may sound. Out of all these processes, at least some Data Science experts may agree that modelling is one of the easiest to get around as long as you have built a model before or maybe have a template to implement one easily.

There are lots of models available today, easy to implement with tons of built-in library supports, from simple to complex mathematical calculations all performed by just calling the name of a function. Gradient Boosting, Logistic regression, SVMs, all just happens to be there to help you, just call for it. Looking at the bright side, we are presented with lots of options with no need to write complex code for mathematical calculations, but look again and with a big problem statement and a bigger dataset in hand, you will see that more choices mean more headache.

The choice of a model can be broken down to some extent by understanding the problem and the data you are presented with. But What model to choose and which will give more optimal result is always a Data Scientist’s most self asked question and high variance in predictions with the same model is a Data Scientist’s worst nightmare.

This article is a direct dive into the implementation of K-Fold Cross-Validation and hence, readers are expected to have a basic idea about how K-Fold Cross Validation works. Use this free guide to understand K-Fold Cross-Validation.

K-Fold Cross Validation With MachineHack’s Food Cost Prediction Hackathon

If you are beginning your Data Science journey and you end up asking that question at some early point in your journey, this article will help you to a great deal. In this article, we will use K-Fold Cross-validation to find out which model best fits a given data set and has a higher probability of giving a better accuracy on your predictions.

Where to get the Data Sets?

Head to MACHINEHACK’s Predicting Restaurant Food Cost Hackathon hackathon by clicking here. Sign Up and start the course. You will find the data set as PARTICIPANTS_DATA_FINAL in the Attachments section.

Having trouble finding the data set? Click here to go through the tutorial to help yourself.

Let us begin by stating a simple definition for K-Fold Cross Validation:

K-Fold Cross Validation involves, training a specific model with (k -1) different folds or samples of a limited dataset and then testing the results on one sample. For example, if K = 10, then the first sample will be reserved for the purpose of validating the model after it has been fitted with the rest of (10 – 1) = 9 samples/Folds.

Lets Code!

Loading And Cleaning the Data:

Since there is already a tutorial on solving the hackathon, I will jump directly to implementing the K-Fold Cross-validation.

Click here to check out the tutorial for Cleaning and preprocessing the data. Follow the tutorial till the Data Preprocessing(Including Data Preprocessing) stage.

Choosing The Right Model With K-Fold Cross Validation

After the Data Preprocessing Stage, the data is now ready to be fitted to a model, but which one?

We will choose three random algorithms and will employee K-Fold Cross Validation to determine which one is the best.

1. XGBoost

We will use the xgboost library. Import the XGBRegressor and fit the training data – X_train and Y_train.

from xgboost import XGBRegressor xgbr = XGBRegressor() xgbr.fit(X_train, Y_train)
The model is now fitted with the data, all we need to do is perform cross-validation to determine the average accuracy we can expect from the xgbr model on different test sets.

The below block uses the cross_val_score method from scikit-learn’s model_selection package for K-Fold Cross-Validation.

from sklearn.model_selection import cross_val_score XGB_accuracies = cross_val_score(estimator = xgbr, X = X_train, y = Y_train, cv = 10) print("Mean_XGB_Acc : ", XGB_accuracies.mean())

The cross_val_score takes the model to be validated (xgbr), X_train, Y_train and a parameter cv as arguments. cv = 10 implies it is a k=10 fold cross validation meaning that 10 folds or samples are created and validated. The method will return an array of values which are the accuracy returned by the model on 10 samples/folds.

Output

Mean_XGB_Acc : 0.6974719315431506
Which implies that the XGBRegressor will give a prediction with an average accuracy of 69% when tested against different data sets. You can also find the standard_deviation by executing

XGB_accuracies.std().

2. Random Forest

from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=307, random_state=1) rf.fit(X_train, Y_train)

from sklearn.model_selection import cross_val_score RF_accuracies = cross_val_score(estimator = rf, X = X_train, y = Y_train, cv = 10) print("Mean_RF_Acc : ", RF_accuracies.mean())

Output

Mean_RF_Acc : 0.7129263668673727

3. Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor gbr=GradientBoostingRegressor( loss = 'huber',learning_rate=0.07,n_estimators=350, max_depth=6,subsample=1,verbose=False) gbr.fit(X_train,Y_train)

from sklearn.model_selection import cross_val_score GB_accuracies = cross_val_score(estimator = gbr, X = X_train, y = Y_train, cv = 10) print("Mean_GB_Acc : ", GB_accuracies.mean())

Output

Mean_GB_Acc : 0.7256236594127805

The Better Model

By comparing the outputs of the 3 models we can conclude that the GradientBoostingRegressor has a slightly higher probability of giving a better prediction in terms of accuracy.

Also, use the following links to our top tutorials to help you with MachineHack’s Hackathons :