Algorithm For AI Inference and Training

Algorithm for Efficient Training

Training Parallelization

This can be achieved through model parallelization by splitting the model over multiple processors or by data parallelization by running multiple training programs in parallel that are limited by the batch size.

Mixed Precision

A program’s performance is limited by three major factors: arithmetic bandwidth, memory bandwidth and latency. Mixed precision training is the combined use of different numerical precisions in a computational method, offering significant computational speedups. Researchers from NVIDIA and Baidu, in their paper on Mixed Precision training, have proposed a methodology to train deep neural networks using half-precision floating point numbers without losing model accuracy or modifying hyper-parameters.

Knowledge Distillation

Mixed Precision Training - Click Here

Ensemble modeling is a process which involves training many different models on the same data to reduce prediction variance and generalization errors. It can be computationally expensive to use predictions from a whole ensemble of models to be deployed to a large base - especially if the individual models are large neural nets. Knowledge distillation is a technique for model compression in which a larger trained neural network (teacher) trains a small network (student) to behave like the large neural network. This enables the deployment of such models on constrained embedded devices.

Dense Sparse Dense Training

Researchers from Stanford, Baidu, and NVIDIA proposed a dense-sparse-dense training flow (DSD) wherein:

The first D (Dense) step involves training to learn connection weights and importance.
The S (Sparse) step prunes the unimportant connections with small weights and retraining under the sparsity constraint.
The final D (re-Dense) step involves removing the sparsity constraint to increase the model capacity, re-initializing the pruned parameters from zero and retraining of network.

This technique can help improve the performance for a wide range of AI algorithms such as CNNs, RNNs, and LSTMs on the tasks of image classification, caption generation, and speech recognition. DSD training does not incur any inference overhead and the final model produced retains the same architecture and dimensions as the original dense model.

Algorithm for Efficient Inference

Model pruning and quantization are prominent techniques that are often adopted for efficient inferencing. While pruning reduces the number of connections, quantization reduces the number of bits that represent each connection – with the result being a more efficient inferencing framework. Pruning and quantization combined with Huffman coding can significantly compress neural networks.

Model Pruning

This involves recursively training and removing unimportant weights from a fully trained network to get an optimal model with acceptable accuracy and desirable performance for the targeted resource-constrained embedded device.

Quantization

Quantization covers various techniques that help to convert input values from a larger set to a reduced set of output values and thereby reducing the number of bits needed to represent information. It may involve converting floating point numbers to integers – i.e., a 32 bit to an 8-bit representation – thereby helping to reduce power, lower memory bandwidth, reduce storage, and improve performance.

DSD: Dense-Sparse-Dense Training for Deep Neural Networks - Click Here

Based on further research conducted at Tsinghua University, the authors propose the use of dynamic-precision data quantization flow and compare it with static precision quantization strategies. This proposed quantization comprises of a two-step process:- Weight Quantization: Analyzes the dynamic range of weights in each layer before narrowing them down to an optimal value- Data Quantization: Uses a greedy algorithm to compare the intermediate data of the fixed-point CNN model and the floating-point CNN model, layer by layer, with a goal to reduce the accuracy loss.

Low Rank Approximation

Approximate computing is a technique for resource-constrained embedded systems to save resources and energy with an acceptable degradation in accuracy.Deep convolutional neural networks, however, contain several parameters which makes it challenging, and sometimes impossible, to work efficiently on embedded devices. In a paper titled ‘Iterative Low-Rank Approximation for CNN Compression’, the researcher proposes an algorithm for network compression that not only addresses the issue of accuracy loss but also increases the rate of compression. Deep neural nets use parameters redundancy to drive convergence. Low rank approximation is a technique to eliminate such redundancy from trained networks by compressing weight tensors. This results in model size compression and speeds up execution.