The popularity of R language has increased exponentially over the past few years and is widely applied in data science and machine learning. In this article, we list you top 10 R packages for data science and machine learning.
The lattice package, written by Deepayan Sarkar, attempts to improve on-base R graphics by providing better defaults and the ability to easily display multivariate relationships. In particular, the package supports the creation of trellis graph, the graphs which display a variable or the relationship between variables, conditioned on one or more other variables. A powerful and elegant high-level data visualization system inspired by Trellis graphics, with an emphasis on multivariate data, this package is sufficient for typical graphics needs and is also flexible enough to handle most nonstandard requirements.
Exploratory Data Analysis (EDA) is the initial and important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.
The package can be installed directly from CRAN. To install type,
DALEX package contains various explainers that help to understand the link between input variables and model output. The single_variable() explainer extracts conditional response of a model as a function of a single selected variable. DALEX is an R library with tools which helps to understand the way complex models work.
To install from CRAN, type
dplyr is a powerful R-package to transform and summarise tabular data with rows and columns. The package contains a set of functions (or “verbs”) that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarising data. In addition, dplyr contains a useful function to perform another common task which is the “split-apply-combine” concept. dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
The purpose of this R package is to let you explore your data quickly to extract the information they hold. It allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar graphs, curves, scatter plots, histograms, then export the graph or retrieves the code generating the graph.
To install from CRAN, type
The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for data splitting, pre-processing, feature selection, model tuning using resampling, variable importance estimation as well as other functionality. The package contains functions to streamline the model training process for complex regression and classification problems. The package utilises a number of R packages but tries not to load them all at package start-up (by removing formal package dependencies, the package startup time can be greatly decreased).
janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimised for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff. The main janitor functions are perfectly format data.frame column names, create and format frequency tables of one, two, or three variables – think an improved table()and isolate partially-duplicate records.
The rpart code builds classification or regression models of a very general structure using a two-stage procedure; the resulting models can be represented as binary trees. The package implements many of the ideas found in the CART (Classification and Regression Trees) book and programs of Breiman, Friedman, Olshen, and Stone. Because CART is the trademarked name of a particular software implementation of these ideas and tree was used for the Splus routines of Clark and Pregibon, a different acronym – Recursive PARTitioning or rpart – was chosen.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.