MITB Banner

Implementing PCA In R With MachineHack’s How To Choose The Perfect Beer Hackathon

Share

Dimensionality Reduction is an important and necessary step when we have big data in hand with so many features. When there are so many features or columns, it is hard to understand the correlation between them. Including weak links or correlations in training-data can also result in an inaccurate prediction by the model.  Dimensionality Reduction takes care of this situation by removing unnecessary features thus helping in fitting the model with the right and most relevant features.

Principal Component Analysis is one of the most sought after Dimensionality Reduction techniques in Machine Learning. In one of our previous articles, we saw how PCA can be implemented in Python.

In this article, we will learn to implement the PCA in R programming. Readers are expected to have some familiarity with the programming language.

Here are some good reads on Dimensionality Reduction :

For implementing PCA in R we will use the How To Choose The Perfect Beer dataset from MachineHack. To get the dataset head to MachneHack, sign up and download the datasets from the Attachment Section.

Having trouble finding the data set? Click here to go through the tutorial to help yourself.

Lets Code!

Importing the Dataset

We will be using a smaller subset of the original dataset and it has been preprocessed to some extent. To know more about processing please check out Data Preprocessing With R: Hands-On Tutorial.

dataset = read.csv('Beer_dataset.csv')

The Dataset is as shown below :

Creating Training set and Test set

library(caTools)
set.seed(123)
split = sample.split(dataset$Score, SplitRatio = 0.80)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

After executing the above code block, we will get two distinct dataframes, the training_set and the test_set.

Without further Preprocessing we will directly fit the model to a Linear Regressor

Initialising the Model and Fitting the Training Data:

# Multiple Linear Regressor
mlr = lm(formula = Score ~ . , data = training_set )
print(mlr)

Output:

Predicting for the Test Set

predy = predict(mlr, newdata = test_set)

This code block will result in an array consisting of the predictions.

Evaluating The Performance

actuals_and_preds <- data.frame(cbind(actuals=test_set$Score, predicteds = predy))

The above code block will result in a dataframe consisting of the predictions and the true or actual test data.

Output :

To evaluate we will consider a more basic MIN-MAX Accuracy.

min_max_accuracy <- mean(apply(actuals_and_preds, 1, min) / apply(actuals_and_preds, 1, max))print(min_max_accuracy)

Output:

0.7055808

Implementing PCA

Importing Necessary Libraries

install.packages("caret") # Execute Once
library(caret)
install.packages("e1071") # Execute Once
library(e1071)

Initializing PCA And Fitting Data

pca = preProcess(training_set[-9], method = 'pca', pcaComp = 2)

The above code block initializes a pca object and fits the training data. The parameter pcaComp refers to the number of principal components you want the model to return. Here the model will return 2 Principal Components.

Generating the Principal Components

training_set.pca = predict(pca, training_set)
test_set.pca = predict(pca, test_set)

This code block will create two new data sets with only the Principal Components and the Actual dependent factor (Score) as columns/features.

Reordering the Columns

training_set.pca = training_set.pca[c(2,3,1)]
test_set.pca = test_set.pca[c(2,3,1)]

Output:

training_set.pca:

test_set.pca:

Using Linear Regression to Predict From The Principal Components

mlr_pca = lm(formula = Score ~ . , data = training_set.pca)print(mlr_pca)

Output:

Predicting With The Principal Components

predy_pca = predict( mlr_pca, newdata = test_set.pca)

Measuring The Accuracy

actuals_and_preds_pca <- data.frame(cbind(actuals= test_set.pca$Score, predicteds = predy_pca))

min_max_accuracy_pca <- mean(apply(actuals_and_preds_pca, 1, min) / apply(actuals_and_preds_pca, 1, max))print(min_max_accuracy_pca)
Output

0.7301324

Conclusion

Comparing the accuracies from the Linear Regression Models, we can see a slight improvement in the accuracy in the model that was fitted with the principal components. The model actually improved with only two features predicting as compared to the original dataset’s 8 independent features. Thus Dimensionality Reduction with PCA holds a significant role in Data Science especially considering the vastness of the data sets that needs to be processed.

Picture of Amal Nair

Amal Nair

A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact: amal.nair@analyticsindiamag.com

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed