AI has already taken up a lot of space in the industries. Even if the general population thinks the technology is not very prominent, it is taking up a lot of applications in the commercial world. We do not even certain applications that could be a result of AI but they are in everyday lives.
With a rise in the technology, it has become extremely essential for neural networks, the technology largely responsible for forecasting using data, also one of the largest applications, to be appropriately quantized.
Importance Of Quantization
Today, the neural networks are deployed to a plethora of application and need a large amount of data for their accurate working. Neural networks have become the state-of-the-art approach for many large-scale computer vision and sequence modeling problems. Deep convolutional networks dominate the leaderboards for popular image classification and object detection datasets such as ImageNet and Microsoft COCO. They naturally need hundreds of megabytes of memory storage for their trainable floating-point parameters, and billions of floating-point operations to make a single inference. In order to achieve large memory many efforts are made to quantize the neural networks, also maintaining the performance.
Quantizing neural networks dates back to the 1990s. The initial motivation of quantization neural network was to make it easier for the digital hardware implementation. The recent importance and research of quantization of neural networks, however, has emerged due to the success of neural network applications. Deep learning has been proven to be powerful on tasks including image classification, objection detection, natural language processing, and many other areas.
For a more accurate prediction and with deeper networks, the memory size of the network becomes a problem. As more smartphones have begun to include these neural networks, these networks being deeper and the memory size being larger becomes a problem to smartphones as well. These phones are generally equipped with 4GB of memory and is expected to be able to support multiple applications at one time. Neural network models that are large in size make phones short of memory since they occupy at least 1GB of memory with about three or more models being run. The model size is not only a memory usage problem, it is also a memory bandwidth problem since the weight of the models is walked every time for each prediction and image related applications usually need to process data in real time. All this accounts for at least 30 FPS. Large memory bandwidth is required for a simple model. Memory, CPU and battery burn the device when the network is running. All of these challenges make the quantization of neural networks a necessity.
How Can Neural Networks Be Quantized?
The goal of quantization is to compact the models without that having any effect on the performance. This will need to have machine learning, computer architecture, and suitable hardware design. There are three components that can be quantized in a neural network: weights, activations, and gradients. The motivation and methods to quantize these components are different from each other.
There are various quantization of neural network methods, but they can broadly be classified into two categories of deterministic quantization and stochastic quantization. A proper quantization technique selection is important.
1.Deterministic quantization: In deterministic quantization, there is a one-to-one mapping between the quantized value and the real value. This method of quantization can specify the appropriate quantization levels in advance to run on dedicated hardware. That is why they should generally be preferred if one wants to quantize for hardware acceleration, giving a greater hardware performance.
2.Stochastic quantization: In stochastic quantization, the weights, activations or gradients are discretely distributed. The quantized value is sampled from the discrete distributions. In this approach, the weights are assumed to be discretely distributed and a learning algorithm is used to infer the parameters of the distributions. It has the quantized weights more interpretable than in deterministic quantization. The distributions of the weights can be understood and gain more insights into how the network works.
Neural networks today are gaining huge popularity and their applications are very powerful in the sphere of machine learning. They have over the years largely grown to solve complex world problems. A small portion of the benefits achieved when using lower precision can be forfeited to increase the network size and therefore the accuracy.
Given the limitations in power budgets dedicated to these networks, the importance of low-power and low-memory solutions are important to emphasize. There has been a lot of research going on in recent times to overcome this challenge.