Hardware for AI Inference and Training

Hardware for AI Inference And Training

CPU and Memory

CPU architecture is based on the Von Neumann architecture and every single calculation of the CPU is stored in L1 cache (memory). This creates the “Von Neumann bottleneck” when memory must be accessed. The ALU (arithmetic logic units) is a circuit that controls memory access and executes each transaction one after the other. In general, there are fundamental challenges faced by AI neural network accelerators when deployed on conventional Von Neumann architecture:

Firstly, large models’ execution on accelerators suffer from frequent access to off-chip memory and consume a significant amount of power
Second, DNN accelerators mostly focus on inference with an assumption of having the model already trained elsewhere on large computes. There are very few architectures that can support training on a constrained device.

The GPU comprises of thousands of ALU’s and cores. But the Von Neumann bottleneck results in one transaction at a time per ALU.

Figure 5: A Conventional CPU Architecture

Temporal Architecture

Temporal architecture, which is typically seen in CPUs and GPUs, use vectors (SIMD) or parallel threads (SIMT) to improve parallelism. In temporal architecture, the ALUs are connected to a centralized controller and cannot communicate directly with each other.

Figure 6: Temporal Architecture

Spatial Architecture

Spatial architecture designs use data flow processing in which ALUs are interconnected and can pass data directly to one another. Such architectures are suitable for DNN (Deep Neural Network) applications.

Figure 7: Spatial Architecture

Deep neural nets have computation overhead as they require a large amount of data for training. Apart from data forward step, needed for generation of new weights in training, error backward and gradient computation are two additional computation steps needed.

AI applications may require many servers during training. Such a setup is likely to increase with time for development and deployment-specific needs. An AI model development, for instance, may only need a few servers to train an initial model with real data and the subsequent training could require much larger servers as the data size increases (e.g., autonomous-driving models may require a large number of servers to reach desirable accuracy in detecting obstacles). As and when there is an increase in the size of training datasets and NNs, a single accelerator may no longer be capable of supporting the training. Since the training task involves input weights synchronization whenever model parameters change, programmable switches can route data in different directions to resynchronize input weights almost instantly and increase the training speed.

DNN Training and Accelerator Arrays

In some cases, it is necessary to deploy an array or a multiple accelerators for training. In a paper titled ‘HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array’, the authors propose a Hybrid Parallelism (HYPAR) model which targets finding an optimized partition that minimizes the overall communication during DNN training. This HYPAR model was evaluated alongside 10 DNN models ranging from classic Lenet to large-size model VGGs and the number of weighted layers for these models ranged from 4 to 19.

Remote Direct Memory Access (RDMA)

RDMA is a particularly useful feature for distributed computing and training complex models that involves directly accessing memory from another computer without using either of the computer’s operating systems. A prominent example of RDMA is the Intel-acquired, Israel-based startups, Habana’s Goya and Gaudi accelerators, which support such a mechanism.

AI Hardware

The hardware infrastructure of an AI chip broadly consists of:

Compute
Memory
Storage

Compute

AI models are compute-intensive and need the right AI hardware architecture and cores to perform thousands of mathematical processes called matrix multiplication. The multiply-and-accumulate (MAC) operations, which are a fundamental component of both the convolution and fully connected layers, can easily be parallelized using parallel computing paradigms. Due to the inherent limitations of general-purpose processing chips, as discussed in the previous section, for AI workloads, it is often necessary to design specialized chips with which novel architectures can witness improved performance. A few key aspects of compute include:

The Neural Processing Unit (NPU) accelerates a segment of program like FFT or Sobel edge detection through on-chip NNs hardware instead of running on a central processing unit (CPU). For example, San Diego and Taipei-based low power edge AI startup, Kneron, provides a reconfigurable neural processing unit whose architecture can be reconfigured to switch between models in real time for application specific needs.
Google’s AI hardware ,known as the Edge TPU, have TPUs which are custom built for matrix operations and training acceleration. TPU’s use an architecture called systolic array which is a network of processors responsible for performing computations, passing the results, enabling a high degree of parallelism and favoring parallel compute.
Cerebras Systems’ Wafer Scale Engine processor architecture comprises of neural network computation cores on the same silicon substrate linked together across the scribe lines. This moves data quickly and at low energy, producing a core-to-core bandwidth of 1000 petabytes per second and an SRAM to core bandwidth of nine petabytes per second.
Neuromorphic computing works on the principles of the human brain and involves artificial components passing information like the action of living neurons. GrAI Matter Labs has developed a neuromorphic computing-based architecture with a unique combination of dataflow architecture and near memory compute at the edge which can deliver autonomy at low power for sensor analytics and machine learning.

Using analog memory-based technologies instead of digital system as analog signals contain no specific ranges. For example, Mythic’s architecture combines compute memory dataflow architecture that is useful for inference, and analog computing for in-memory matrix operations.

Memory

AI applications have high memory-bandwidth requirements, as computing layers within deep neural networks are very demanding. This is because of the need to pass data between cores as quickly as possible. Therefore, these transactions can consume up to 95% of the energy needed to do machine learning and AI. Memory is required to store input data and weight and perform various other functions during both AI training and inferencing. Each MAC requires multiple memory reads (for filter weight, FMAP activation, and partial sums) and at-least one memory write. Therefore, every time a piece of data is moved from an expensive energy level to a lower-cost level, it must be reused as much as possible to minimize subsequent accesses to the expensive levels.

High Bandwidth Memory (HBM)

This technology enables AI workloads to process large data sets at maximum speed and power. HBM vertically stacks memory chips and allows DL compute processors to access memory through a fast connection called Through-Silicon Via (TSV). Google and Nvidia have adopted HBM as their memory solution.

On-Chip Memory

Data access outside of chip memory takes more time than memory on the same chip. Hence, on-chip memory applications are seen as viable alternatives. There are a few prominent use cases of this technology in the market already: Bristol-based Graphcore’s Intelligence Processing Unit (IPU) architecture can hold an entire ML model inside the processor, while researchers at Stanford University have designed a DNN inference system, termed Illusion, that consists of eight networked computing chips, each of which contains a certain minimal amount of local on-chip memory and mechanisms for quick wakeup and shutdown. The system tricks the hybrids into thinking they are one chip and hence referred to as Illusion System.

Storage

The storage needs of an AI solution depend on whether the application is used for training or inference. AI training systems store and use massive volumes of data algorithms, whereas AI inference systems store input data that is needed for future training. As a result, new forms of non-volatile memory—all of which differ in terms of memory access time and cost—cause potential disruptions in storage. Some forms of non-volatile memory are:

Resistive Random-Access Memory (RERAM OR RRAM)

The Memory Resistor (memristor) is a non-volatile memory type that stores information using cell resistances, leading to higher switching speeds and lower power consumption. ReRam has slower latency and reduced endurance.

Magnetoresistive Random-Access Memory (MRAM)

MRAM has the lowest latency for read and write, with high endurance and more than five-year data retention capabilities. It is a relatively costly alternative and may be used for a frequently accessed cache rather than long term data retention.

Phase Change Memory (PCM) OR P-RAM

This random-access memory is outfitted with elements that can rapidly change between amorphous and crystalline states. It offers superior performance as compared to the more commonly used flash memory. A key advantage of this technology is that the memory state does not require continuous power to remain stable and can have code directly executed from memory. This helps avoid copying code into RAM and making code execution faster.

The Evolving AI Hardware Market

The market revenue of artificial intelligence in hardware market ecosystem was estimated at US$ 21.44 Bn in 2019 and the number is expected to US$ 65.44 Bn, registering a growth of 25.8% during the review period from 2020 to 2027. As the race for AI hardware heat up, both large and small players are coming out with hardware to support AI workloads. These innovations and the minds behind their creation are poised to revolutionize the hardware market in the near future.

Habana Labs, acquired by Intel, built an AI Inference (Goya) and training processor (GAUDI). GAUDI, integrates RoCE (RDMA over converged ethernet) (RDMA – remote direct memory access).
Leepmind is building an ultra-low-power inference accelerator that can run AI models in small, 1-to 2 bits data formats with nearly the same accuracy as 8-bit format.
Hailo is building a programmable DL chip for the edge with SDK that can support your NN of choice.
Nvidia, Jetson Xavier NX edge AI chip supports up to 21 TOPS of accelerated computing, directed at AI inference.
Kneron’s NPU-IP series enables audio and video applications through the NN processors designed for edge devices on the fly.
EdgeQ intends to combine AI compute and 5G within a single chip for edge compute networks, addressing 5G private network for IIOT use cases.

Blaize GSP (Graph streaming processor) can runs multiple AI workloads and models on a single system concurrently. Their Blaize AI software suite enables AI applications developers to build entire applications optimized to run efficiently for edge deployment.
SambaNova is building a Reconfigurable Dataflow Unit (RDU) designed to allow the dataflow through the processor in a way the model was intended to run without any bottlenecks by eliminating constant data caching and excess data movement.
Sima.ai is focused on building green Machine Learning SOC (MLSoC) to deliver the industry’s highest frames per second per watt solution.
Texas-based OneTech AI’s is working on an AI solution that embeds and trains AI models directly on to smart assets at the edge.

Additionally, there are many xPUs related to AI hardware accelerators that have been developed in the last several years, covering almost the entire spectrum of these applications. Some of these include:

BPU (Brain Processing Unit): An AI chip developed by Horizon Robotics for autonomous driving and image recognition enabled smart cameras.
DPU (Data flow Processing Unit): A data flow-based solution from Wave Computing for AI and DL needs.
EPU (Emotional Processing Unit): An MCU microchip designed by Emoshape to enable emotional response in AI, robots, and consumer devices.

IPU (Intelligence Processing Unit): Graphcore is building an IPU targeted for Graph-related applications with tightly coupled memory to keep the entire ML model and data on chip.
TPU (Tensor Processing Unit): A Google specialized hardware for neural networks.
VPU (Vision Processing Unit): A specialized chip for computer vision workloads e.g. Intel Myriad.