Model selection and parameter tuning can be a headache at times. It’s not an easy task to pick the best algorithm given plenty of options and neither is it possible to try out every one of them to find the best one. It is always a good practice or says a necessary practice to try out at least two or three algorithms. This is one thing that a Data Scientist has to rely on some experience and practical knowledge.
Before we begin, a short reminder to all Data Science folks to check out Machinehack’s latest hackathon – Predicting The Costs Of Used Cars – Hackathon By Imarticus Learning. Click here to participate and win exciting prizes.
The algorithms come with a lot of arguments that directly affect the predictions. This time it’s even harder as we have different combinations of values to try for different arguments. This is where a technique called Grid Search comes in handy.
In this article, we will talk about parameter tuning using Grid Search Cross Validation and will implement it in Python.
Parameter Tuning With Grid Search
Where to get the Dataset?
Head to MACHINEHACK’s Predicting Restaurant Food Cost Hackathon by clicking here. Sign up and start the course. You will find the data set as PARTICIPANTS_DATA_FINAL in the Attachments section.
Having trouble finding the data set? Click here to go through the tutorial to help yourself. We will be using the same example as in the K-Fold Cross validation tutorial.
Grid Search is a simple algorithm that allows us to test the effect of different parameters on the efficiency of a model by passing multiple parameters to cross-validation and testing each combination for a score.
Let’s Code!
Loading And Cleaning the Data
Since there is already a tutorial dedicated to solving the hackathon, We will jump directly to implementing the Grid Search for parameter tuning.
Click here to check out the tutorial for Cleaning and preprocessing the data. Follow the tutorial till the Data Preprocessing(Including Data Preprocessing) stage.
Implementing Grid Search CV
After data preprocessing we will proceed to modelling and we already have a best fitting algorithm for our model from K-Fold cross-validation which is the GradientBoostingRegression.
By performing K-Fold Cross Validation on three popular algorithms with the given data, we got the best score with Gradient Boosting Algorithm. The code snippet is given below:
from sklearn.ensemble import GradientBoostingRegressor
gbr=GradientBoostingRegressor( loss = 'huber',learning_rate=0.07,n_estimators=350,
max_depth=6,subsample=1,verbose=False)
gbr.fit(X_train,Y_train)
from sklearn.model_selection import cross_val_score
GB_accuracies = cross_val_score(estimator = gbr, X = X_train, y = Y_train, cv = 10)
print("Mean_GB_Acc : ", GB_accuracies.mean())
Output
Mean_GB_Acc : 0.7256236594127805
Now the task is to identify the right parameters for our model. Let us have a look at the Gradient Boosting Regressor.If you check out the library here, you will see a long method with a lot of parameters as shown below:
class sklearn.ensemble.GradientBoostingRegressor(loss=’ls’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)
Of course, we are not going to use all these parameters but a few important parameters that can really impact the predictions. For now we will focus on ‘loss’,’learning_rate’, ‘n_estimators’,’max_depth’ and ‘subsample’.
Now, we have 5 parameters, and we have to pass a value as an argument to each of these parameters on initializing the gradient boosting regressor. Trying out different values is simply out of the options as there will be numerous combinations to try, in fact, this is exactly what Grid Search will carry out for you.
Let’s do some tuning on GradientBoostingRegressor so that we get a better score.
The Grid Search is available with sci-kit learn’s model_selection package
Importing the library
from sklearn.model_selection import GridSearchCV
Initializing the parameters
params = [{'loss': ['ls','huber'], 'learning_rate': [0.05, 0.07, 0.15, 0.2], 'n_estimators':
200],'max_depth': [5], 'subsample' : [1] },
{'loss': ['ls','huber'], 'learning_rate': [0.05, 0.07, 0.2], 'n_estimators': [350],'max_depth': [6], 'subsample' : [1] },
{'loss': ['ls','huber'], 'n_estimators': [100],'learning_rate': [0.1], 'max_depth': [4], 'subsample' : [1] }]
In the above code block, we initialize the different combinations of parameters we want to try in a list of dictionaries with the parameter names as keys and the arguments as values.
Initializing the Grid Search Cross Validator
gs = GridSearchCV(estimator = gbr, param_grid = params, scoring = 'explained_variance', cv = 10, n_jobs = -1)
In the above code block, we initialize the Grid Search Cross Validator by specifying our model and the parameters that we initialized earlier along with a few other parameters as detailed below:
n_jobs: Number of jobs to run in parallel. -1means using all processors.
scoring: Strategy to evaluate the predictions on the test set. Click here to check all available strategies.
cv: Determines the cross-validation splitting strategy. Specifies the number of folds in a K-Fold
Fitting the data
gs.fit(X_train,Y_train)
Printing the best score and best parameters
print(“Best Score : ”, gs.best_score_)
print(“Best Parameters : ”,gs.best_params_)
On executing all the above code blocks, I received the following output :
Best Score : 0.727682231131411
Best Parameters : {'learning_rate': 0.05, 'loss': 'ls', 'max_depth': 6, 'n_estimators': 350, 'subsample': 1}
As you can observe, the score returned by Grid search is slightly better than the score we got on K-Fold Cross Validation and the parameters for which we got the best score are learning_rate = 0.05, loss = ls, max_depth = 6, n_estimators = 350, subsample = 1.
Note :
One important factor to be noted is that even though Grid Search is one of the most popular algorithms used to find the best parameters, it comes with a cost. As it trains and performs cross-validation of the model with different parameters, the overall training time for the model increases which is inevitable and is a big hurdle when handling huge datasets.