To a large extent, deep learning is all about solving optimisation problems. According to computer science researchers, stochastic gradient descent, better known as SGD has become the workhorse of Deep Learning, which, in turn, is responsible for the remarkable progress in computer vision.
Despite its simplicity, SGD is a simple variant of classical gradient descent where the stochasticity comes from employing a random subset of the measurements (mini-batch) to compute the gradient at each descent. It also has implicit regularisation effects, making it suited for highly non-convex loss functions, such as those entailed in training deep networks for classification.
SGD is so popular that it is now being billed as the cornerstone for deep learning. According to Sanjeev Arora, a professor of Computer Science at Princeton, research in deep learning is taking place in four core areas
- Non-convex optimisation
- Over-parameterisation and generalisation
- Role of depth
- Generative models
SGD falls under the non-convex optimisation problem. Google researcher Ali Rahimi indicated that the study of non-convex optimisation for deep neural networks will address two questions largely:
- What does the loss function look like?
- Why does SGD converge?
Good optimisation is the core part of deep learning and a significant performance boost often comes from better optimisation techniques. In fact, researchers believe the choice of optimisation algorithms matters, especially when one is dealing with large datasets. This is especially the case for stochastic algorithms. Because, in stochastic settings, researchers only observe a subset of the data at a particular time, That is why the improved optimisation techniques allow them to make the best use of data efficiently. One particular trick is maintaining a running mean of gradients over time and adding that to the current gradient.
Advantages of Stochastic Gradient Descent for learning problems:
- According to a senior data scientist, one of the distinct advantages of using Stochastic Gradient Descent is that it does the calculations faster than gradient descent and batch gradient descent. However, gradient descent is the best approach if one wants a speedier result.
- Computer scientists claim that performing one pass of SGD on a particular dataset is statistically (minimax) optimal. In other words, no other algorithm can get one better results on the expected loss (on all possible data distributions
- Also, on massive datasets, stochastic gradient descent can converges faster because it performs updates more frequently. Also, the stochastic nature of online/minibatch training takes advantage of vectorised operations and processes the mini-batch all at once instead of training on single data points.
- Facebook’s chief AI scientist emphasised the reason behind the popularity of SGD is that it can process more examples within the available computation time.
- A lot of modern optimisation algorithms such as RMSProp and Adam are based on gradient descent, but the question is are these superior to the standard stochastic gradient descent
- In particular, stochastic gradient descent delivers similar guarantees to empirical risk minimisation, which exactly minimises an empirical average of the loss on training data. So, for many learning problems, SGD is not really a “poor” optimisation procedure.
- In the context of large-scale learning, SGD has received considerable attention and is applied to text classification and natural language processing. Two key benefits of Stochastic Gradient Descent are efficiency and the ease of implementation. In a situation when data is less, classifiers in the module are scaled to problems with more than 10^5 training examples and more than 10^5 features.
- Stochastic gradient descent is best suited for unconstrained optimisation problems. In contrast to BGD, SGD approximates the true gradient of E(w,b) by considering a single training example at a time.
The disadvantages of SGD include:
- SGD requires a number of hyperparameters and a number of iterations
- It is also sensitive to feature scaling
According to a paper by University of Buffalo’s Department of Computer Science and Engineering, Stochastic Gradient Descent is powering nearly all of deep learning applications today. SGD is an extension of gradient descent algorithm and it is a method of generalisation beyond the training set. Furthermore, the paper states that outside of deep learning, SGD is the main way to train large linear models on very large data sets. With the exponential growth of interest in Deep Learning, which started in the academic world around 2006, SGD, thanks to its simplicity in implementation and efficiency in dealing with large scale datasets, has become by far the most common method for training deep neural networks and other large scale.