As machine learning and artificial intelligence pervade the computing environment, the drive for better hardware resources is increasing significantly. Although computational hardware is optimised to its best, it needs to be the perfect fit for ML applications.
With a plethora of devices available in the market — multi-core processors, large cloud-based databases — it is often tough to choose them to serve the exact ML purpose. One such hardware component that has picked up popularity in recent times is the accelerator. The accelerators are a class of microprocessors which are designed specifically to serve AI and ML related tasks.
In this article, we will discuss a particular type of accelerator — developed by researchers at Institute of Computing Technology (ICT), China — which is embedded on a powerful processor and has proven to be energy-efficient.
Accelerator And Its Design
A certain set of ML algorithms such as convolutional neural networks and deep neural networks are being gradually deployed across most self-learning applications. These algorithms require powerful computing resources in order to perform efficiently. Currently, accelerators such as Graphical Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs) compute ML algorithms with complex neural networks.
However, these hardware components focus on the implementation of algorithms rather than look at the effect that these algorithms have on memory and processing speed. ML algorithms such as neural networks may eventually grow in size and become more complex if modifications are made over time. As a result, it presents a computation challenge. This necessitates demand for a flexible accelerator design that would accommodate changes, both in terms of scalability and efficiency of ML projects especially when it comes to algorithms that involve large neural networks.
Researchers at ICT kept all these factors in mind to design a novel accelerator. Most importantly, the design incorporates high performance for a small area (a microprocessor chip) consuming less power and leaving a small energy footprint. Hence, the focus on the design is more on memory rather than computation.
Using Processors For Design
Large neural networks (NN) and similar ML algorithms typically involve more memory traffic during its working. It is essential to design accelerators layerwise for these networks to make the most out of the performance. In the design study by Tianshi Chen and others from ICT, China, they consider processor-based implementations and apply locality analysis to every layer in the network. They benchmark the performance on four convolutional neural networks, CLASS1, CONV3, CONV5 and POOL5 and assess the bandwidth impact they have on the memory. In the researchers’ words:
“We use a cache simulator plugged to a virtual computational structure on which we make no assumption except that it is capable of processing Tn neurons with Ti synapses each every cycle. The cache hierarchy is inspired by Intel Core i7: L1 is 32KB, 64-byte line, 8-way; the optional L2 is 2MB, 64-byte, 8-way. Unlike the Core i7, we assume the caches have enough banks/ports to serve Tn × 4 bytes for input neurons, and Tn×Ti ×4 bytes for synapses. For large Tn, Ti, the cost of such caches can be prohibitive, but it is only used for our limit study of locality and bandwidth.”
This is again experimented along three categories of NNs — classifiers, convolutional layers and pooling layers. Convolutional layers fare optimally in terms of synapses and neuron balance in line with performance. They produce unique synapses and is not reused again by neurons. Therefore, convolutional layers offer more memory bandwidth compared to the other NNs.
Accelerator In NNs
The NNs are implemented on a hardware and are matched with conceptual representation of these networks mentioned earlier. The neurons form the logic circuits and the synapses form the RAM or memory. These components are now integrated into embedded system applications for quicker performance with less power consumption. Similarly for larger and complex NN, buffers are present in between the neurons to compensate for data control and temporary storage. These are again connected to a computational sub-system to compute neurons and synapses (in the study, its referred to as Neural Functional Unit and the control logic).
Therefore, the accelerator consists of neurons, synapses, input (NBin) & output (NBout) buffers for input and output neurons respectively, synaptic weights (SB) and a computational sub-system. The typical accelerator architecture is given below:
After all these processes, it is tested on three tools namely accelerator simulator, CAD tools and single instruction, multiple data (SIMD) computers. The first two tools are for exploring and simulating the accelerator architecture. The latter one assesses energy and memory in the accelerator. It was observed to be 100 times faster in performance on a 128-bit 2GHz SIMD core, with energy reduction by 21 times compared to a standard multi-core processor.
Conclusion
The accelerator mentioned here can be implemented on a broader set of ML algorithms. All it needs is due diligence with respect to NN layers, storage structures and ML parameters. One particular point to be noted is that this accelerator performed well with a high throughput in a very small processor area. This means as ML implementations grow bigger, hardware structure complexities can be brought down with innovations.