When it comes to Machine Learning and Artificial intelligence there are only a few top-performing programming languages to choose from. In the previous tutorial, we learned how to do Data Preprocessing in Python. Since R is among the top performers in Data Science, in this tutorial we will learn to perform Data Preprocessing task with R.
(Note: The following tutorial will require basic programming knowledge of R.)
In this tutorial, we will learn to perform the following operations on a raw dataset:
- Dealing with missing data
- Dealing with categorical data
- Splitting the dataset into training and testing sets
- Scaling the features
Data Preprocessing in R
The following steps are crucial:
Importing The Dataset
dataset = read.csv('dataset.csv')
As one can see, this is a simple dataset consisting of four features. The dependent factor is the ‘purchased_item’ column. If the above dataset is to be used for machine learning, the idea will be to predict if an item got purchased or not depending on the country, age and salary of a person. Also, the highlighted cells with value ‘NA’ denotes missing values in the dataset.
Dealing With Missing Values
dataset$age = ifelse(is.na(dataset$age),ave(dataset$age, FUN = function(x) mean(x, na.rm = 'TRUE')),dataset$age)
dataset$salary = ifelse(is.na(dataset$salary), ave(dataset$salary, FUN = function(x) mean(x, na.rm = 'TRUE')), dataset$salary)
The above code blocks check for missing values in the age and salary columns and update the missing cells with the column-wise average.
- dataset$column_header: Selects the column in the dataset specified after $ (age and salary).
- is.na(dataset$column_header): This method returns true for all the cells in the specified column with no values.
- ave(dataset$column_header, FUN = function(x) mean(x, na.rm = ‘TRUE’)): Ths method calculates the average of the column passed as argument.
dataset$age = as.numeric(format(round(dataset$age, 0)))
Since we are not interested in having decimal places for age we will round it up using the above code. The argument 0 in the round function means no decimal places.
After executing the above code block the dataset would look like what’s shown below :
- Unlike Python where we use Numpy arrays to store the data to perform operations, we directly perform our operations on the dataset, which is a list, in R.
- We do not need to categorize the dependent and independent factors explicitly since R uses an attribute called formula to identify dependent and independent factors from a dataset.
Dealing With Categorical Data
Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, educational level etc.
In our dataset, we have two categorical features, nation, and purchased_item. In R we can use the factor method to convert texts into numerical codes.
dataset$nation = factor(dataset$nation, levels = c('India','Germany','Russia'), labels = c(1,2,3))
dataset$purchased_item = factor(dataset$purchased_item, levels = c('No','Yes'), labels = c(0,1))
- factor(dataset$olumn_header, levels = c(), labels = c()) : the factor method converts the categorical features in the specified column to factors or numerical codes.
- levels: the categories in the column passed as a vector. Example c(‘India’,’Germany’,’Russia’)
- labels: The numerical codes for the specified categories in the same order. Example c(1,2,3))
Splitting The Dataset Into Training And Testing Sets
We will use the caTools library in R to split our dataset to training_set and test_set
install.packages('caTools') #install once
library(caTools) # importing caTools library
split = sample.split(dataset$purchased_item, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
- set.seed(): The seed function preserves the uniqueness of the split i.e, for each seed value, the split will be unique. It is similar to the random_state argument in python.
- sample.split(dataset$dependent_factor, SplitRatio = 0.8): This method will return boolean values with the length of the original dataset in the specified SplitRatio .0.8 gives 80 percentage Trues and 20 percentage Falses. For example, the above code block will assign the variable split with values [TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE]
- subset(dataset, split == TRUE): This method will return a subset of the dataset passed as an argument where the split is True. (80 percent of the original dataset with respect to the given code)
- subset(dataset, split == FALSE): This method will return a subset of the dataset passed as an argument where the split is False. (20 percent of the original dataset with respect to the given code)
Scaling The Features
training_set[,3:4] = scale(training_set[,3:4])
test_set[,3:4] = scale(test_set[,3:4])
The scale method in R can be used to scale the features in the dataset. Here we are only scaling the non-factors which are the age and the salary.