Deep learning has been making tremendous progress in the fields of computer vision , natural language processing and other fields of machine learning. But how these deep learning models work, at the scale they do has been an open question for quite a while. This mystery surrounding deep learning has made many researchers to focus specifically on understanding large deep learning models.
Computational learning theory gives a formal framework to study the predictive powers and computational powers of machine learning models. It is mostly associated with knowing about the efficiency of data usage (sample complexity) and computation usage (time complexity).
Recent research led by Dr. Naftali Tishby has made exciting connections between the field of computational learning theory and information theory. When thinking about information theory, most machine learning practitioners have a very narrow perspective of the same. Information theory is thought to be a field mostly related to communication and compression. But recent developments coupled with decade-old research has suggested that information theory has massive implications in deep theory.
Generalisation Bounds For Deep Learning
Computational learning theory gives generalisation bounds to estimate the power of machine learning models. But as it turns out these kinds of specified limitations don’t really work for deep learning. The primary reason for this is due to the exponentially large number of parameters present in the neural networks. Since the inspiration behind neural networks is the human brain, it becomes harder to understand how today’s neural networks actually work.
There is an ongoing struggle between researchers who are trying to understand how neural networks achieve such generalisations. Recently Naftali Tishby, a computer scientist and neuroscientist from the Hebrew University of Jerusalem, created some excitement among artificial intelligence researchers when he offered a theory of how we can use information theory to explain advances in deep learning.
According to Tishby, deep neural networks follow a procedure known as information bottleneck, which he and two other collaborators had worked on in 1999. Now recently Tishby is back with more experiments that validate his claims. His theory says that deep learning procedures compress information during training and throw away useless information — much like Sir Arthur Conan Doyle’s famous detective Sherlock Holmes.
The Connection Between Information Theory And Deep Learning
Over the years, deep learning has experienced incredible success. But at the same time, there is a constant criticism on the lack of theoretical explanation about deep networks. In a standard statistical paradigm, the main tactic is to have a large number of candidates and restrict or remove complex solutions. In deep learning, stochastic gradient descent already works as powerful regulariser. But how does it actually work? It is still mathematically unclear.
Fig.1 The structure of a deep neural network, which consists of the target label Y, input layer X, hidden layers h1,…,hm and the final prediction Ŷ . (Image source: Tishby and Zaslavsky, 2015)
Information theory helps us to peek into the back box of deep learning. The training data contains sampled observations from the joint distribution of X and Y. The input variable X and weights of hidden layers are all high-dimensional random variables. The ground truth target Y and the predicted value Ŷ are random variables of smaller dimensions in the classification settings. If we label the hidden layers of a DNN as h1, h2,…, hm as showcased in the figure above, we can view each layer as one state of a Markov Chain: hi→hi+1
A Deep Neural Network is designed to learn how to describe X to predict Y and eventually, to compress X to only hold the information related to Y. Tishby describes this process as “successive refinement of relevant information”.
The information plane theorem put forward, frames each layer by its encoder and decoder information. The encoder is a representation of the input data X, while the decoder translates the information in the current layer to the target output Y.
Deep Learning As An Information Bottleneck Procedure
In 2015, Tishby and his students presented an hypothesis that deep learning is an information bottleneck procedure. The procedure compresses input data while retaining as much information as possible. Newer research now digs deeper into this hypothesis. As a part of this, the researchers used a neural network to recognise dogs. And they then investigated what how the network behaves with 3,000 sample input data points.
They then observed how much information each layer of the neural network retained and how much of it is related to the output label. The researchers observed that the networks converged to the information bottleneck theoretical bound. The information bottleneck theoretical bound is a very rare deep learning theoretical limit described in Tishby’s original paper. This suggests the limit or the best case of the neural network extracting all the relevant information.
The idea is that neural networks compress the input data as much as possible without losing the generalisation ability. This discovery also puts a spotlight on the peculiarity of the training phases of large neural networks. The researchers discovered that deep learning happens in two phases — a short fitting phase, where the network learns to predict the labels, and a longer compression phase where the network works to better its generalisation capabilities.
Learning Is Forgetting
Deep learning was imagined by early AI pioneers as a way to replicate the workings of the human brain. But since the early days of deep learning, the research has swayed away from the biological plausibility of the models. More and more emphasis has been laid to how the models work on real-world tasks. But it still remains to be seen how much of the studies in neuroscience can be translated into advances in the deep learning models.
But Tishby strongly believes that his ideas will be useful both in, neuroscience and the machine learning communities. He proudly says, “The most important part of learning is actually forgetting.” This is quite appropriate since his ideas suggest that leaving some details behind can help us build models which learn better.