Last updated February 15, 2021
In AI Mysteries

Implementing PCA In R With MachineHack’s How To Choose The Perfect Beer Hackathon

Published on May 14, 2019

by Amal Nair

Dimensionality Reduction is an important and necessary step when we have big data in hand with so many features. When there are so many features or columns, it is hard to understand the correlation between them. Including weak links or correlations in training-data can also result in an inaccurate prediction by the model. Dimensionality Reduction takes care of this situation by removing unnecessary features thus helping in fitting the model with the right and most relevant features.

Principal Component Analysis is one of the most sought after Dimensionality Reduction techniques in Machine Learning. In one of our previous articles, we saw how PCA can be implemented in Python.

In this article, we will learn to implement the PCA in R programming. Readers are expected to have some familiarity with the programming language.

Here are some good reads on Dimensionality Reduction :

For implementing PCA in R we will use the How To Choose The Perfect Beer dataset from MachineHack. To get the dataset head to MachneHack, sign up and download the datasets from the Attachment Section.

Having trouble finding the data set? Click here to go through the tutorial to help yourself.

Lets Code!

Importing the Dataset

We will be using a smaller subset of the original dataset and it has been preprocessed to some extent. To know more about processing please check out Data Preprocessing With R: Hands-On Tutorial.

dataset = read.csv('Beer_dataset.csv')

The Dataset is as shown below :

Creating Training set and Test set

library(caTools) set.seed(123) split = sample.split(dataset$Score, SplitRatio = 0.80) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)
After executing the above code block, we will get two distinct dataframes, the training_set and the test_set.

Without further Preprocessing we will directly fit the model to a Linear Regressor

Initialising the Model and Fitting the Training Data:

# Multiple Linear Regressor mlr = lm(formula = Score ~ . , data = training_set ) print(mlr)
Output:

Predicting for the Test Set

predy = predict(mlr, newdata = test_set)

This code block will result in an array consisting of the predictions.

Evaluating The Performance

actuals_and_preds <- data.frame(cbind(actuals=test_set$Score, predicteds = predy))

The above code block will result in a dataframe consisting of the predictions and the true or actual test data.

Output :

To evaluate we will consider a more basic MIN-MAX Accuracy.

min_max_accuracy <- mean(apply(actuals_and_preds, 1, min) / apply(actuals_and_preds, 1, max))print(min_max_accuracy)

Output:

0.7055808

Implementing PCA

Importing Necessary Libraries

install.packages("caret") # Execute Once library(caret) install.packages("e1071") # Execute Once library(e1071)

Initializing PCA And Fitting Data

pca = preProcess(training_set[-9], method = 'pca', pcaComp = 2)

The above code block initializes a pca object and fits the training data. The parameter pcaComp refers to the number of principal components you want the model to return. Here the model will return 2 Principal Components.

Generating the Principal Components

training_set.pca = predict(pca, training_set) test_set.pca = predict(pca, test_set)

This code block will create two new data sets with only the Principal Components and the Actual dependent factor (Score) as columns/features.

Reordering the Columns

training_set.pca = training_set.pca[c(2,3,1)] test_set.pca = test_set.pca[c(2,3,1)]

Output:

training_set.pca:

test_set.pca:

Using Linear Regression to Predict From The Principal Components

mlr_pca = lm(formula = Score ~ . , data = training_set.pca)print(mlr_pca)

Output:

Predicting With The Principal Components

predy_pca = predict( mlr_pca, newdata = test_set.pca)

Measuring The Accuracy

actuals_and_preds_pca <- data.frame(cbind(actuals= test_set.pca$Score, predicteds = predy_pca))

min_max_accuracy_pca <- mean(apply(actuals_and_preds_pca, 1, min) / apply(actuals_and_preds_pca, 1, max))print(min_max_accuracy_pca)
Output

0.7301324

Conclusion

Comparing the accuracies from the Linear Regression Models, we can see a slight improvement in the accuracy in the model that was fitted with the principal components. The model actually improved with only two features predicting as compared to the original dataset’s 8 independent features. Thus Dimensionality Reduction with PCA holds a significant role in Data Science especially considering the vastness of the data sets that needs to be processed.

Access all our open Survey & Awards Nomination forms in one place >>

Amal Nair

A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact: amal.nair@analyticsindiamag.com

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

Impact of Lok Sabha Election on India's AI Progress

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the