The last century has seen tremendous innovation in the field of mathematics. New theories have been postulated and traditional theorems have been made robust by persistent mathematicians. And we are still reaping the benefits of their exhaustive endeavours to build intelligent machines.

Here is a list of five theorems which act as a cornerstone for standard machine learning models:

**The Gauss-Markov Theorem**

The first part of this theorem was given by Carl Friedrich Gauss in the year 1821 and by Andrey Markov in 1900. The modern notation of this theorem was given by FA Graybill in 1976.

**Statement:** When the error probability distribution is unknown in a linear model, then, amongst all of the linear unbiased estimators for the parameters of the linear model, the estimator obtained using the method of least squares is the one that minimises the variance. The mathematical expectation of each error is assumed to be zero, and all of them have the same (unknown) variance.

**Application:** Linear Regression models

**Universal Approximation theorem**

**Statement:** A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of **R^**n, under mild assumptions on the activation function.

**Application**: Artificial neural networks

### Singular Value Decomposition

It can be used for eigen decomposition of a symmetric matrix with positive eigenvalues to any m x n matrix by polar decomposition.

**Statement:** Suppose **M** is a *m* × *n* matrix whose entries come from the field *K*, which is either the field of real numbers or the field of complex numbers. Then there exists a factorisation, called a ‘singular value decomposition’ of **M**, of the form

Where

**U**is an*m*×*m*unitary matrix over*K,*(unitary matrices are orthogonal matrices),**Σ**is a diagonal*m*×*n*matrix with non-negative real numbers on the diagonal,**V**is an*n*×*n*unitary matrix over*K*, and**V**∗ is the conjugate transpose of**V**.

**Application:** Principal Component Analysis

### Mercer’s Theorem

Postulated by Mercer in 1909, this theorem represents symmetric positive functions on a square as the sum of convergence of product functions.

**Statement:** Suppose *K* is a continuous symmetric non-negative definite kernel. Then there is an orthonormal basis {*e*i}i of *L*2[*a*, *b*] consisting of eigen functions of *K* such that the corresponding sequence of eigenvalues {λ*i*}*i* is non-negative. The eigen functions corresponding to non-zero eigenvalues are continuous on [*a*, *b*] and *K* has the representation

**Application:** Support Vector Machines.

### Representer Theorem

**Statement:** Among all functions, which admit an infinite representation in terms of eigen functions because of Mercer’s theorem, the one that minimises the regularised risk always has a finite representation in the basis formed by the kernel evaluated at the ‘n’ training points.

Where H is the Hilbert space and k is the reproducing kernel.

**Application**: Kernel tricks (class of algorithms for pattern analysis, Support Vector Machines)