Small-scale accelerators in machine learning a brief overview data recovery galaxy s6

In this article, we will discuss a particular type of accelerator — developed by researchers at Institute of Computing Technology (ICT), China — which is embedded on a powerful processor and has proven to be energy-efficient. Accelerator And Its Design

A certain set of ML algorithms such as convolutional neural networks and deep neural networks are being gradually deployed across most self-learning applications. These algorithms require powerful computing resources in order to perform efficiently. Currently, accelerators such as Graphical Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs) compute ML algorithms with complex neural networks.

However, these hardware components focus on the implementation of algorithms rather than look at the effect that these algorithms have on memory and processing speed.

ML algorithms such as neural networks may eventually grow in size and become more complex if modifications are made over time. As a result, it presents a computation challenge. This necessitates demand for a flexible accelerator design that would accommodate changes, both in terms of scalability and efficiency of ML projects especially when it comes to algorithms that involve large neural networks.

Researchers at ICT kept all these factors in mind to design a novel accelerator. Most importantly, the design incorporates high performance for a small area (a microprocessor chip) consuming less power and leaving a small energy footprint. Hence, the focus on the design is more on memory rather than computation. Using Processors For Design

Large neural networks (NN) and similar ML algorithms typically involve more memory traffic during its working. It is essential to design accelerators layerwise for these networks to make the most out of the performance. In the design study by Tianshi Chen and others from ICT, China, they consider processor-based implementations and apply locality analysis to every layer in the network. They benchmark the performance on four convolutional neural networks, CLASS1, CONV3, CONV5 and POOL5 and assess the bandwidth impact they have on the memory. In the researchers’ words:

“We use a cache simulator plugged to a virtual computational structure on which we make no assumption except that it is capable of processing T n neurons with T i synapses each every cycle. The cache hierarchy is inspired by Intel Core i7: L1 is 32KB, 64-byte line, 8-way; the optional L2 is 2MB, 64-byte, 8-way. Unlike the Core i7, we assume the caches have enough banks/ports to serve T n × 4 bytes for input neurons, and T n ×T i ×4 bytes for synapses. For large T n , T i , the cost of such caches can be prohibitive, but it is only used for our limit study of locality and bandwidth.”

This is again experimented along three categories of NNs — classifiers, convolutional layers and pooling layers. Convolutional layers fare optimally in terms of synapses and neuron balance in line with performance. They produce unique synapses and is not reused again by neurons. Therefore, convolutional layers offer more memory bandwidth compared to the other NNs. Accelerator In NNs

The NNs are implemented on a hardware and are matched with conceptual representation of these networks mentioned earlier. The neurons form the logic circuits and the synapses form the RAM or memory. These components are now integrated into embedded system applications for quicker performance with less power consumption. Similarly for larger and complex NN, buffers are present in between the neurons to compensate for data control and temporary storage. These are again connected to a computational sub-system to compute neurons and synapses (in the study, its referred to as Neural Functional Unit and the control logic).

Therefore, the accelerator consists of neurons, synapses, input (NB in ) & output (NB out ) buffers for input and output neurons respectively, synaptic weights (SB) and a computational sub-system. The typical accelerator architecture is given below: Figure: Accelerator architecture with direct memory access (DMA) – Image by Tianshi Chen

After all these processes, it is tested on three tools namely accelerator simulator, CAD tools and single instruction, multiple data (SIMD) computers. The first two tools are for exploring and simulating the accelerator architecture. The latter one assesses energy and memory in the accelerator. It was observed to be 100 times faster in performance on a 128-bit 2GHz SIMD core, with energy reduction by 21 times compared to a standard multi-core processor. Conclusion

The accelerator mentioned here can be implemented on a broader set of ML algorithms. All it needs is due diligence with respect to NN layers, storage structures and ML parameters. One particular point to be noted is that this accelerator performed well with a high throughput in a very small processor area. This means as ML implementations grow bigger, hardware structure complexities can be brought down with innovations.