The life cycle of a machine learning models involves a training phase, where a typical data scientist develops a model with good prediction based on historical data and features extracted from the data at hand. This model is then put into production with the hope that it would continue to have similar predictive performance during the course of its deployment.
A typical machine learning pipeline would consist of the following processes:
- Data collection
- Data cleaning
- Feature extraction (labelling and dimensionality reduction)
- Model validation
The captured data should be pulled and put together and the benefits of collection should outweigh the costs of collection and analysis. And, feature extraction becomes a key aspect of any data-driven project.
The central idea behind using any feature selection technique is to simplify the models, reduce the training times, avoid the curse of dimensionality without losing much of information. Modern data often consists of feature vectors with a large number of features. The conversion of data into vectors is domain specific.
For example, Length squared sampling in matrices, Singular value decomposition, Low-rank approximation are few techniques which are widely used in the data processing. For example, the singular value decomposition finds the best-fitting k-dimensional subspace for k= 1,2,3,…,For the set of N data points. Here, “best” means minimizing the sum of the squares of the perpendicular distances of the points to the subspace, or equivalently, maximizing the sum of squares of the lengths of the projections of the points onto this subspace.
SVD is traditionally used in the Principal Component Analysis(PCA). PCA is popular with dimensionality reduction but the underlying assumptions of PCA depend on linearities. For nonlinear problems which are how real-world scenarios usually are, models like Autoencoders and Genetic algorithms offer significant solutions.
Denoising To Solve Identity Function Problem
An encoder part will be equivalent to PCA if linear encoder, linear decoder, square error loss function with normalized inputs are used. Which means PCA is restricted to linear maps only whereas autoencoders are not.
Though these models were developed to handle non-linearities in data, Autoencoders with more hidden layers than inputs run the risk of learning the identity function – where the output simply equals the input – thereby becoming useless
In order to overcome this, Denoising AutoEncoders(DAE) was developed. In this technique, the input is randomly induced by noise. This will force the autoencoder to reconstruct the input or denoise.
Denoising is recommended as a training criterion for learning to extract useful features that will constitute a better higher level representation.
The idea here is that whenever a network is being trained, it generates a model, and measures the distance between that model and the benchmark through a loss function. Its attempts to minimize the loss function involve resampling the shuffled inputs and re-reconstructing the data until it finds those inputs which bring its model closest to what it has been told is true.
When the input is induced with noise, DAEs, which are trained to construct clean input from the corrupted, reconstruct it. During the course of reconstruction, the DAE learns higher level representations (features) as a consequence.
Sample code snippet to induce noise in the input
<code>def get_corrupted_input(self, input, corruption_level):</code>
This function keeps
1-corruption_level entries of the inputs the same and zero-out randomly selected subset of size “corruption_level
return self.theano_rng.binomial(size=input.shape, n=1,
p=1 - corruption_level,
dtype=theano.config.floatX) * input
Check the full implementation of DAE here
In short, a Denoising Auto-Encoder does two things:
- try to encode the input (preserve the information about the input)
- try to undo the effect of a corruption process stochastically applied to the input of the auto-encoder.
The idea of training a multi-layer perceptron and denoising tasks is not new. The approach was first introduced by LeCun (1987) as an alternative method to learn an (auto-)associative memory similar to Hopfield Networks (Hopfield, 1982).
Experiments done in the past show that, contrary to ordinary autoencoders, Denoising AutoEncoders(DAE) are able to learn edge detection similar to a Gabor filter from natural image patches and larger stroke detectors from digit image.
This clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher-level representations.