MITB Banner

Why Data Analysis & Not Math Is A Prerequisite For Machine Learning

Share

If you are an absolute Machine Learning beginner and are wondering whether data analysis is a prerequisite, then here’s the hard-fact – data analysis meaning the task of gathering data, cleaning data, exploring and visualizing data is an absolute must before one gets started on machine learning.

However, let’s also get one thing clear – machine learning is as much about linear algebra, probability theory and statistics (especially graphical models) and information theory as much as data analysis. And data analysis forms an important part of understanding — ML algorithms are used with real world data, but without the knowledge of data processing/data-science since data never comes in structured, labeled format, you wouldn’t get far with algorithms. According to a section of ML practitioners, data science and machine learning are essentially two sides to the same field.

Let’s see how data analysis will help you level up on ML

  • First, you won’t be able to build a good enough model if you don’t have solid skills with data analysis
  • Even if you use packaged tools like Python’s scikit-learn –that end up performing the hard math–  one needs to have a solid understanding to make these tools work effectively. Because a solid understanding of exploratory data analysis and data visualization, you can’t get far in machine learning
  • Even for application of tools such as caret and scikit-learn, you’ll need to be able to gather, prepare, and explore your data. You a need solid understanding of data analysis

Let’s enumerate how one can use Data Science as a platform to dive into basics of Machine Learning

1) 80% of data science work involves data prep

By now, it is common knowledge that 80% of data science work involves data preparation, EDA, and visualization and for most data scientists, data organization and manipulation is still a much-needed skill and this is where they implement all machine learning algorithms using scikit-learn.

This means when one is building machine learning models, 80% of the time will be spent in gathering data, exploring it, cleaning it, and analyzing results with data visualization.

2) Knowing how to manipulate data is critical

For beginning ML practitioners, manipulating data is more critical than understanding the math underlying the algorithm: While Linear algebra is the building block of machine learning and forms the key to understanding the statistics applied in ML, most data science practitioners have a working understanding of calculus or linear algebra.

However, they are excellent data analysts and usually lean towards the minimum requirement of math and fill in the gaps on the job. According to a data science practitioner from financial sector, if you want to be able to write an algorithm from scratch, you need a very high understanding of linear algebra. If you want to a data science practitioner, otherwise one doesn’t need a high-level knowledge of calculus to understand how an algorithm behaves.

However, in the long run advanced math is an absolute must, but in the short-term, one must focus on data-visualization/data-manipulation stack in R or Python.

This the most widely recommended package to get started for visualization/wrangling/analysis:

R: ggplot2, dplyr, tidyr, stringr

Python: numpy, pandas, matplotlib, seaborn

3) Before one dives into ML, you need to master visualization

 The job description of an entry-level data scientist involves a lot of data aggregation and data visualization. This in turn helps a lot to perform exploratory data analysis. For professionals who prefer R, you can learn: ggplot2 for data visualization, including basic visualizations like scatterplots, histograms, bar charts and also learn how to use ggplot and dplyr together for exploratory data analysis. Python users can learn to use Pandas and data visualizations together for exploratory data analysis.

4) Linear algebra is defined as the workhorse of Machine Learning  

That said, Linear algebra is important if you want to understand the inner workings of machine learning and gradient descent. One can’t emphasize enough the importance of grasping essential concepts of statistics and probability, given how machine learning is often dubbed as statistical learning.

The field is so vast and endless that it is difficult to follow a focused learning plan and most entry-level data scientists grapple with covering all the essential concepts in a short span of time. For a deeper understanding of the algorithms one needs statistic and stochastic process. But this is the moment, it becomes difficult since one needs knowledge of calculus and Linear Algebra.

However, for an absolute beginner it can be difficult to understand all the important aspects and that’s why a foundation in data analysis, can help one build machine learning models that work. Also, one must remember that during a machine learning workflow, the experience from exploratory data analysis will help as an input to the “data transformation” step of ML workflow.

Outlook

Not everybody has a rigorously quantitative background to work their way through the math required for Machine Learning. Given the rising interest in the field, and a lack of formal training, most beginners (who follow the self-learning path) find it challenging and frustrating to master the concepts completely. That’s why, beginners can use data analysis as a platform to dive into machine learning without completely mastering linear algebra or calculus.

Meanwhile, here’s a guide to ML by Jason Brownlee where he talks about how to get a handle of Linear Algebra for ML. According to Brownlee, there are a minimum of 3 topics one must cover – a) Notation (it will allow one to piece things together); b) Operations which means learning how to perform simple operations such as multiplying, transposing matrics and c) Matrix Factorization, this requires a deep dive into concepts like SVD and QR. This forms the bedrock of machine learning.

Besides, don’t forget to brush up the basics with these books on ML– Elements of Statistical Learning. Hastie, Tibshirani, Friedman & Information Theory, Inference, and Learning Algorithms by David MacKay. For Linear Algebra, check out Linear Algebra, Theory, and Applications by Kuttler

PS: The story was written using a keyboard.
Picture of Richa Bhatia

Richa Bhatia

Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed