The last century has seen tremendous innovation in the field of mathematics. New theories have been postulated and traditional theorems have been made robust by persistent mathematicians. And we are still reaping the benefits of their exhaustive endeavours to build intelligent machines. The field of data science is built on some ingenious mathematical and logical hypotheses and tools.
Here we list a few concepts from the Principal Researcher at Microsoft Research India, Ravi Kannan’s book, which forms the foundation of Data Science:
Singular Value Decomposition
Modern data often consists of feature vectors with a large number of features. The conversion of data into vectors is domain specific.
High-dimensional geometry and Linear Algebra are two of the crucial areas which form the mathematical foundations of Data Science.
Length squared sampling in matrices, Singular value decomposition, Low rank approximation are few techniques which are widely used in data processing.
For example, the singular value decomposition finds the best-fitting k-dimensional subspace for k= 1,2,3,…,For the set of N data points. Here, “best” means minimizing the sum of the squares of the perpendicular distances of the points to the subspace, or equivalently, maximizing the sum of squares of the lengths of the projections of the points onto this subspace.
SVD is traditionally used in principal component analysis. PCA is popularly used for feature extraction and knowing how significant the relationship among the features/properties is to an outcome.
Very often than not, data is unstructured, vast and vague. Making sense of it is the job of a data scientist. The simplest most intuitive way of reducing the complexities in data is to divide it into groups and then deal with them on an individual level. Grouping or gathering data points is done traditionally using clustering methods like k-means. Lloyd’s algorithm is one such, which goes as follows:
- Start with k centers.
- Cluster each point with the center nearest to it.
- Find the centroid of each cluster and replace the set of old centers with the centroid.
- Repeat the above two steps until the centers converge according to some criterion, such as the k-means score no longer improving.
Lloyd’s algorithm does not necessarily find a globally optimal solution but will find a locally-optimal one. An important but unspecified step in the algorithm is its initialization: how the starting k centers are chosen.
Be it sentiment analysis for recommendation systems or identifying protein sequences in cancer cells, clustering is very applicable.
A good machine learning model makes predictions from a database of random examples. The basic goal is to perform as well, or nearly as well, as the best predictor in a family of functions, such as neural networks or decision trees. For a given model and function family, if this goal can be achieved under some reasonable constraints, the family is said to be learnable in the model.
Machine-learning theorists are typically able to transform questions about the learnability of a particular function family into problems that involve analysing various notions of dimension that measure some aspect of the family’s complexity. For example, the appropriate notion for analysing PAC learning is known as the Vapnik–Chervonenkis (VC) dimension, and, in general, results relating learnability to complexity are sometimes referred to as Occam’s-razor theorems.
Occam’s razor is the notion, stated by William of Occam around AD 1320, that in general one should prefer simpler explanations over more complicated ones.
Why should one do this, and can we make a formal claim about why this is a good idea? What if each of us disagrees about precisely which explanations are simpler than others?
What it does say is that Occam’s razor is a good policy in that simple rules are unlikely to fool us since there are just not that many simple rules.
As a machine learning model heads to production, all these statistical methods and techniques will come down to one thing– a YES or NO decision.
The book Foundations of Data Science authored by Avrim Blum, John Hopcroft and Ravindran Kannan, consists of other interesting rudimentary topics like:
- Law of large numbers
- Geometry of high dimensions
- Matrix operations
- Random walks in Euclidean space
- Gradient Descent methods
- Graph partitioning
- Bayesian or belief networks and many other concepts supplemented with intuition behind the math.
Download the free book here