Recently, Elon Musk demonstrated Tesla’s Full Self-Driving Computer (FSD) which has been designed to process 2100 frames per second. The computer is designed to process video and audio through a Neural Net and output the commands which the vehicle should abide. The processor validates the instructions with realtime feedback to make sure the vehicle is sticking to them.

The case with Google is quite similar. They also launched Tensor Processing Unit (TPU) 2 years ago. TPUs are designed to make Neural Net training and predictions orders of magnitude faster.

The question is why not simply use multiple GPUs rather than putting an effort to design a new processor which involves expensive R&D.

There are two primary aspects –

  1. Efficiency– Achieve maximum operations per second a processor can perform while reducing the power consumption. This helps in reducing the overall cost of training and deploying neural networks in production systems. As per Google, it can reduce costs to 1/5th of existing which is millions of dollar in savings.
  2. Speed– Make executing operations as fast as possible. For example, in Self Driving vehicles, safety is of utmost importance so you can’t afford any missing data and delay in results. The current architecture allows Tesla to process 2100 Frames per second which is huge.

Let’s see what has been done to achieve these traits.

Fast Operations

In case of neural net, 99% of the operations are either multiplications or additions because all the transformations in neural net are done on matrices. The core idea behind designing these processors is to make these 99% operation extremely fast while sacrificing the general purpose use cases such as Branch prediction, multithreading etc.

A typical CPU core can perform only few multiplications in a single clock cycle if you utilise vector instructions such as AVX2, SSE etc.

A GPU can perform thousands of multiplications due to larger number of cores but a single core is still inefficient as it needs to store and retrieve data from the registers and it has lower clock speed than CPU core.

The Neural Net processors overcome this particular challenge by changing ALU in CPU to MMU (Matrix Multiplication Unit). The MMU typically contains tens of thousands of integer Multipliers and Adders each of which is connected to it’s neighbours. The data from one unit is transferred to the next unit directly without requiring any access to memory.

The obvious drawback of above implementation is that they are not able to do other operations effectively. So they have been designed to run only Neural Net.

No Floating Point Calculation

Floating point calculations require more complex hardware which consumes more power and space. In Neural Net we are generally fine with approximate results which need not be accurate to decimal places. For example, just knowing the expected time to destination is 14 minutes is fine in Google Maps, you don’t need to compute 14.146 minutes.

Both Google and Tesla achieve this by using 8 or 32 bit Integer Multipliers and Adders instead of Floating-Point units. Because of less space requirements, a basic TPU by Google contains 65,536 multipliers and Tesla’s processor contains 9216 mutipliers/adders. This allows them to perform tens of thousands of multiplications in a single clock cycle.

Hardwired Functions

Another aspect of Neural Network is the activation functions. Without activation function, a neural net is just linear algebra and will fail to produce any significant results.

Activation functions are hard to compute because they involve exponentials. Both of these processors have hardwired popular activation functions such as ReLU, Sigmoid and Tanh so that they are computed blazingly fast.

Complex Instructions

These processors use CISC like Intel/AMD. However, the CISC instructions are used for executing much more complex workloads such as taking convolution of Matrix or apply activation function rather than simple tasks such as Add, load, store and more. This helps in executing complicated tasks such as adjusting weights of a neural net layer during training with only a few instructions.

To read more on this topic, you can refer to the following sources: