MITB Banner

Model Selection With K-fold Cross Validation — A Walkthrough with MachineHack’s Food Cost Prediction Hackathon

Share

A Data Scientist goes through a lot of data that needs to be cleaned, pre-processed, modelled and visualised. All these processes may not be as simple as it may sound. Out of all these processes, at least some Data Science experts may agree that modelling is one of the easiest to get around as long as you have built a model before or maybe have a template to implement one easily.

There are lots of models available today, easy to implement with tons of built-in library supports, from simple to complex mathematical calculations all performed by just calling the name of a function. Gradient Boosting, Logistic regression, SVMs, all just happens to be there to help you, just call for it. Looking at the bright side, we are presented with lots of options with no need to write complex code for mathematical calculations, but look again and with a big problem statement and a bigger dataset in hand, you will see that more choices mean more headache.

The choice of a model can be broken down to some extent by understanding the problem and the data you are presented with. But What model to choose and which will give more optimal result is always a Data Scientist’s most self asked question and high variance in predictions with the same model is a Data Scientist’s worst nightmare.

This article is a direct dive into the implementation of K-Fold Cross-Validation and hence, readers are expected to have a basic idea about how K-Fold Cross Validation works. Use this free guide to understand K-Fold Cross-Validation.

K-Fold Cross Validation With MachineHack’s Food Cost Prediction Hackathon

If you are beginning your Data Science journey and you end up asking that question at some early point in your journey, this article will help you to a great deal. In this article, we will use K-Fold Cross-validation to find out which model best fits a given data set and has a higher probability of giving a better accuracy on your predictions.

Where to get the Data Sets?

Head to MACHINEHACK’s Predicting Restaurant Food Cost Hackathon hackathon by clicking here. Sign Up and start the course. You will find the data set as PARTICIPANTS_DATA_FINAL in the Attachments section.

Having trouble finding the data set? Click here to go through the tutorial to help yourself.

Let us begin by stating a simple definition for K-Fold Cross Validation:

K-Fold Cross Validation involves, training a specific model with (k -1) different folds or samples of a limited dataset and then  testing the results on one sample. For example, if K = 10, then the first sample will be reserved for the purpose of validating the model after it has been fitted with the rest of (10 – 1) = 9 samples/Folds.

Lets Code!

Loading And Cleaning the Data:

Since there is already a tutorial on solving the hackathon, I will jump directly to implementing the K-Fold Cross-validation.

Click here to check out the tutorial for Cleaning and preprocessing the data. Follow the tutorial till the Data Preprocessing(Including Data Preprocessing) stage.

Choosing The Right Model With K-Fold Cross Validation

After the Data Preprocessing Stage, the data is now ready to be fitted to a model, but which one?

We will choose three random algorithms and will employee K-Fold Cross Validation to determine which one is the best.

1. XGBoost

We will use the xgboost library. Import the XGBRegressor and fit the training data – X_train and Y_train.

from xgboost import XGBRegressor
xgbr = XGBRegressor()
xgbr.fit(X_train, Y_train)

The model is now fitted with the data, all we need to do is perform cross-validation to determine the average accuracy we can expect from the xgbr model on different test sets.

The below block uses the cross_val_score method from scikit-learn’s model_selection package for K-Fold Cross-Validation.

from sklearn.model_selection import cross_val_score
XGB_accuracies = cross_val_score(estimator = xgbr, X = X_train, y = Y_train, cv = 10)
print("Mean_XGB_Acc : ", XGB_accuracies.mean())

The cross_val_score takes the model to be validated (xgbr), X_train, Y_train and a parameter cv as arguments. cv = 10 implies it is a k=10 fold cross validation meaning that 10 folds or samples are created and validated. The method will return an array of values which are the accuracy returned by the model on 10 samples/folds.

Output

Mean_XGB_Acc :  0.6974719315431506
Which implies that the XGBRegressor will give a prediction with an average accuracy of 69% when tested against different data sets. You can also find the standard_deviation by executing

XGB_accuracies.std().

2. Random Forest

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=307, random_state=1)
rf.fit(X_train, Y_train)

from sklearn.model_selection import cross_val_score
RF_accuracies = cross_val_score(estimator = rf, X = X_train, y = Y_train, cv = 10)
print("Mean_RF_Acc : ", RF_accuracies.mean())

Output

Mean_RF_Acc :  0.7129263668673727

3.  Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor
gbr=GradientBoostingRegressor( loss = 'huber',learning_rate=0.07,n_estimators=350, max_depth=6,subsample=1,verbose=False)
gbr.fit(X_train,Y_train)

from sklearn.model_selection import cross_val_score
GB_accuracies = cross_val_score(estimator = gbr, X = X_train, y = Y_train, cv = 10)
print("Mean_GB_Acc : ", GB_accuracies.mean())

Output

Mean_GB_Acc :  0.7256236594127805

The Better Model

By comparing the outputs of the 3 models we can conclude that the GradientBoostingRegressor has a slightly higher probability of giving a better prediction in terms of accuracy.

Also, use the following links to our top tutorials to help you with MachineHack’s Hackathons :

  1. Flight Ticket Price Prediction Hackathon: Use These Resources To Crack Our MachineHack Data Science Challenge
  2. Hands-on Tutorial On Data Pre-processing In Python
  3. Data Preprocessing With R: Hands-On Tutorial
  4. Getting started with Linear regression Models in R
  5. How To Create Your first Artificial Neural Network In Python
  6. Getting started with Non Linear regression Models in R
  7. Beginners Guide To Creating Artificial Neural Networks In R
Share
Picture of Amal Nair

Amal Nair

A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact: amal.nair@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.