While tools and techniques are the most interesting aspect of data science today and make an efficient data scientist, they should not be the only area of focus from a learning perspective to becoming an effective data scientist. It is important to master additional areas such as problem formulation, design thinking, presentation and communication of insights, and production. All these areas are imperative and one cannot replace each other.
Today, when we say there is a dearth of data science talent in market, we are actually referring to talent where there is no mix of the above-mentioned skill sets. This is also leading to higher failure rates in data science project. This article talks about the importance of honing all the skills sets which are imperative in a data scientist’s learning path.
Data Science Learning Path Simplified
Let’s try to simplify the ever growing space of tools and techniques which is currently pursued in the data science community. I have attempted to classify this based on two parameters– firstly, on the robustness and other on fundamentality.
The robustness can be looked at from application standpoint, mathematical rigor and agility on data conditions and on the other side, fundamentality is from the concept, it is core and foundational for modern techniques. In my experience some of the recent tools and techniques has shown greater flexibility and robustness and hence created new opportunity of applications; whereas there are lot of fundamentally important techniques which are basically foundation to many new and advanced techniques which may not be used anymore due to modern data challenges, but are very important to learn for the correct adoption of successive tools and techniques.
For the below figures, the tools and techniques depicted are not exhaustive and axes scale is subjective and may change depending on the problem domain.
Each of these tools and techniques are used for exclusive problem domain, specific data considerations and may come with specific mathematical assumptions to be considered while using them. The intent of this post is not to explain the details of these tools and techniques; however, I will highlight broader purpose and importance of each group. Let us start with machine learning algorithms first and then we can look at specific tools to use.
Firstly, there are so many algorithms available today, it can be overwhelming even to seasoned practitioners. While learning any algorithm, my recommendation is to first look at basic intuition of the algorithms and then understand how it helps in different problem domain. Once you find grip on this, then look at mathematics of algorithm as it is somewhat important and helps you catch up with new advancement in a given algorithm.
All techniques or algorithms can be grouped in many ways:
- based on their learning style from the data.
- based on similar purpose it serves
- based on mathematical concepts they use
I chose to group the algorithm based on last two options than the learning style. However, I will be using all parameters to describe each group of algorithm I have shown in the figure below.
So let’s get started:
One of the most fundamental statistical learning algorithm in data science, I won’t hesitate to say it’s the father of all supervised learnings. Yes, broadly all regression techniques are class of supervised learning techniques (training data has outcome labels to supervise the learning process of an algorithm). The core concept of all regression techniques is modelling relationship between variables in order to reduce the error in the prediction with the model built in each step of the regression process.
There are various advancements in the algorithms that belong to regression process; below are listed few main algorithms under this class plotted based on parameters of importance.
Clustering techniques are primarily unsupervised learning (training data has no labels for the supervision of the model). The purpose is to group objects into similar groups based on pattern and structure identified among set of nominated variable set. The methods revolve around finding similarity or dissimilarity matrices based on centroids of one object vs. the other objects. The process is agglomerative or hierarchical in nature.
The most popular algorithms based on fundamentality and robustness is as below.
Decision Tree Techniques:
These are used mainly for supervised learning models. The algorithms build models of decision rules based on actual values of variables in the dataset. Decision models split data in a tree structure until a prediction (classification or regression) is made. It is iterative and at each step of tree creation; there is measure of error (entropy) computed to make decision for next step.
Dimension Reduction Techniques:
These are similar to clustering techniques but usually applied on reducing dimensionality in variables than records or cases unlike clustering. The purpose is to simplify data by reducing the dimensionality in the data using less information. The concept is similar to clustering but more advanced as it seeks and exploits inherent structure in the data with reduced set of dimensions but maximising information in the dimensions as much as possible.
These techniques are useful to visualise and summarise complex information and many a times used as intermediate analysis step in many of supervised model building. Some of the most commonly and recently developed techniques are shown below.
Neural Network and Deep Learning Techniques:
Very popular family of techniques derived or inspired based on biological (human) brain structure (neural network). In my view, they are mainly pattern matching techniques useful for both regression (continuous outcome) and classification (discrete outcome) problems and used in both supervised and unsupervised learning models. The reason they are special is that they can model complex relationship between variables (including latent hidden variables) and many a times comprised of several algorithms to learn variations in data structures. Some of the most popular and extremely useful cutting-edge technological applications are shown below.
These are set of techniques which applies Bayes Theorem in the problem domain. They are gaining popularity in both supervised and unsupervised learning models. They can be applied to both classification and regression problems with greater agility and extreme flexibility on data conditions. Some of the most popular techniques in this space are shown below.
Time Series Analysis Techniques:
These are slightly different group of techniques especially applied when underlying data has temporal structure. Especially within field of statistical learning, these are some of specially designed techniques used to understand the time-based phenomenon in a problem domain. Some of the fundamental and popular techniques in recent time are shown below.
These are special set of techniques which are build to overcome weaknesses of models that are build independently on the same training data. The idea is to combine prediction (classification or regression) from different algorithms and combine them into one in order to overcome weakness of certain models. The idea is to identify efficient and robust ways of combining various models into one. Some of the key techniques are shown below:
Now, lets spare some time on various types of tools being used in the field of data science. I have grouped these tools into languages, tools specialised for visualisation, tools designed for machine learning framework, processing, storing and analysing big data and of course tools build for focused machine learning process.
I have consciously excluded commercial tools, packages and platforms from the below classification. They are mainly commercial tools and one needs to learn and use if you are part of corporate community. They are designed and built to make overall data science implementation easier. I recommend data scientist to use and adopt these tools in the later stages as they helped me in changing my focus from programming towards business application of the models built.
Some of the platforms on which I gained exposure are worthy of listing here SAS, SPSS, RapidMiner, IBM Modeller, Alteryx, Bayesialab, Azure ML, Azure stream, Tableau, QlikView, Microstrategy, Lumira, Power BI etc. as they are commercial tools and one needs to learn and use anyways part of the job.
As I mentioned in the beginning, there has to be equal focus on other important aspect of model building in advance analytics such as problem formulation, designing analytical process, insight generation, visualisation and presentation, and last but not the least model maintenance and governance in the enterprise set-up.
I have not covered various tools and techniques which are still commonly used and also those which are designed for specific tasks in this blog. I would like to conclude this by thanking you for taking time to read through this. Wish you all the best in your aspiration to become an effective and efficient data scientists.