Last updated October 17, 2018

What is Representer Theorem in Machine Learning?

Share

Published on May 24, 2018

by Abhishek Sharma

Machine learning involves a plethora of concepts from mathematics and statistics at its core. Therefore, a good solid understanding of both these subjects is essential to analyse ML in depth. In fact, to develop algorithms in ML, knowledge of statistics is crucial. There’s a separate branch called Statistical Learning Theory that derives insights from statistics as well as from Functional Analysis. This article explores a concept called Representer Theorem in statistical learning theory, which finds application in areas such as pattern analysis and specifically, Support Vector Machines (SVM).

Kernel Methods And RKHS

In the statistical context for ML, kernel methods are a group of algorithms that cater to pattern analysis, which focus on identifying patterns in data. In order to perform tasks related to detecting patterns, most of the algorithms (apart from kernel methods), require data which is converted into feature vectors. On the other hand, kernel methods (or kernels, specifically) rely on similarity functions. Kernels have the advantage of operating on a feature space, simply by computing values of the pairs of data without considering the coordinates of that data space. This is called the ‘inner product’.

Kernels employ memory-based learning, which means that instead of using the generalisation approach, it compares unknown and new instances to the training instances stored in the memory.

Kernel methods rose to importance as a result of advances in pattern recognition, specifically handwriting recognition. As development in kernel methods grew, many concepts related to this were idealised in mathematics. One particular concept that serves importance in ML was Reproducing Kernel Hilbert Space (RKHS), which was first developed by Polish mathematician Stanislaw Zaremba when he worked with harmonic functions.

As the name suggests, RKHS derives its mathematical function from Hilbert Space, a vector space that generalises two-dimensional and three-dimensional objects (in the backdrop of Euclidean geometry). In mathematical terms, it is defined as:

“Hilbert space is a vector space H with an inner product (f, g) such that norm defined by,

|f| = √(f,f)

turns into a complete metric space.”

Now, RKHS establishes a linear relationship in a Hilbert space of different functions. For example, the inner product mentioned earlier contains two functions, f and g. If RKHS is applied for these functions, the norm should be as small as possible for it to be linearly functional. These two concepts form the basis for representer theorem

Representer Theorem

With the help of RKHS, a unique proposition known as the representer theorem was formulated. This is because popular kernels possess the problem of infinite dimensional space which may seem mathematically feasible but not practically viable, especially for training a learning machine which generally deals with optimisation.

There are two cases in the representer theorem, one without prior assumptions (nonparametric) and the other with partial assumptions (semiparametric). The definition and mathematical representation for both of these cases are given below:

Nonparametric Representer theorem (Theorem 1): Suppose we are given a nonempty set X , a positive definite real-valued kernel k on X ×X , a training sample (x₁, y₁),…,(x_m, y_m) ∈X×R, a strictly monotonically increasing real-valued function g on [0,∞], an arbitrary cost function c : (X × R2)^m → R ∪ {∞}, and a class of functions

F = {f ∈ R^X |f(·) = (Σ)^∞ _i=1 β_ik(·, z_i), β_i ∈ R, z_i∈ X , ||f|| < ∞}

Here, · is the norm in the RKHS associated with k, i.e. for any z_i ∈ X , βi ∈ R (i ∈ N) given

|| Σ^∞_i=1 β_ik (·, zi)||² = Σ^∞_i=1 Σ^∞_j=1 β_i β_j k (z_i, z_j)

Then any f ∈ F minimising the regularised risk functional

c ((x₁, y₁, f(x₁),…,(x_m, y_m, f(x_m))) + g (|| f ||)

admits a representational form

f (·) = Σ^m_i=1 ⍺_ik (·, x_i) ”

Semiparametric representer theorem (Theorem 2) : Suppose that in addition to the assumptions of the previous theorem we are given a set of M real-valued functions {ψ_p}^M_p=1 on X , with the property that the m × M matrix (ψ_p(x_i))_ip has rank M. Then any f’ := f + h, with f ∈ F and h ∈ span{ψ_p}, minimizing the regularized risk

c ((x₁, y₁, f’(x₁),…,(x_m, y_m, f’(x_m))) + g (|| f ||)

admits a representational of the form

f’(·) = (Σ)^m_i=1 α_ik (x_i, ·) + (Σ)^M_p=1 β_pψ_p (·),

with unique coefficients β_p ∈ R for all p = 1,…., M.”

The above theorems minimise factors such as real-valued function g and cost function c. In ML context, these theorems give provisions for kernels in the training data.

Conclusion:

These mathematical representations may confuse many beginners. It is suggested that they go through the basic concepts of kernel methods for better understanding before working with representer theorems. They are very important while training models. As a result of an understanding of these concepts, algorithms will have an optimal risk functional with minimum regularisation.

Access all our open Survey & Awards Nomination forms in one place

Abhishek Sharma

I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.