MITB Banner

What Is The Secret Sauce Behind Google TPUs High Performance?

Google raked in over $8 billion in revenue last quarter and now they want to scale up their services in India by tapping into the public sector. 

Back in April, CEO Thomas Kurian described Google Cloud’s new services to target different industries that includes media and entertainment, healthcare, retail, financial services, public sector. According to NASSCOM, the cloud market in India is likely to soar to $7.1 billion by 2022 with the developmental leaps in Big Data analytics, AI, ML and IoT.

As enterprises break monoliths apart and start modernising services, they need solutions for consistent service and traffic management at scale. Organisations want to invest time and resources in building applications and innovating, not on the infrastructure and networking required to deploy and manage these services. Machine learning being the hottest choice with enterprises currently, companies like Google, which outsource their technology, are leaving no stone unturned to notch up their infrastructure to meet the demands of the future. Their cloud TPU’s stand as testimony to their never-ending efforts.

Overview Of TPUs

via Google Cloud blog

Cloud TPU is designed to run cutting-edge machine learning models with AI services on Google Cloud. Its custom high-speed network offers over 100 petaflops of performance in a single pod — enough computational power to transform a business or create the next research breakthrough.

The second- and third-generation TPU chips are available to Google Cloud customers as Cloud TPUs. They deliver up to 420 teraflops per Cloud TPU device and more than 100 petaflops in a full Cloud TPU v3 Pod. Cloud TPUs achieve this high performance by uniting a well-established hardware architecture — the “systolic array” — with an innovative floating-point format.

FLOPs(Floating point operations per second) are units of measure of performance of a computational operation. A processor with a higher range of FLOPs is considered to be more powerful.

Allocating FLOPs(resource) during algorithmic operations(neural networks) is key to the time taken on training and other such fundamental operations.

In the case of Google TPU’s, the custom floating-point format is called “Brain Floating Point Format,” or “bfloat16” for short. The name flows from “Google Brain”, which is an artificial intelligence research group at Google where the idea for this format was conceived. Bfloat16 is carefully used within systolic arrays to accelerate matrix multiplication operations on Cloud TPUs. 

What Edge Does Bfloat16 Give To TPUs?

Bfloat16 is a custom 16-bit floating-point format for machine learning that’s comprised of one sign bit, eight exponent bits, and seven mantissa bits. This is different from the industry-standard IEEE 16-bit floating-point, which was not designed with deep learning applications in mind.

Here are a few noticeable improvements achieved with Bfloat16:

  • Storing values in bfloat16 format saves on-chip memory, making 8 GB of memory per core feel more like 16 GB, and 16 GB feel more like 32 GB. 
  • More extensive use of bfloat16 enables Cloud TPUs to train models that are deeper, wider or have larger inputs. And since larger models often lead to higher accuracy, this improves the ultimate quality of the products that depend on them. 
  • Better compiler trade-offs between compute and memory saving can be achieved, resulting in performance improvements for large models. 
  • Storing operands and outputs of those ops in the bfloat16 format reduces the amount of data that must be transferred, improving speed.

How ML Can Benefit

via Google Cloud blog

Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. Techniques have been developed in the past to train deep neural networks using half-precision floating-point numbers where the weights, activations and gradients are stored in IEEE half-precision format. 

With bfloat16 too, there is a choice for each of the values of weights (parameters), activations, and gradients to be represented.

The team at Google Cloud, claim that some models are even more permissive, and in these cases representing both activations and weights in bfloat16 still leads to peak accuracy. So, the developers recommend keeping weights and gradients in FP32 but converting activations to bfloat16 and advise the ML practitioners to run an occasional baseline using FP32 for weights, gradients, and activations to ensure that the model behaviour is comparable.

It is believed that support for mixed-precision training throughout the TPU software stack allows for seamless conversion between the formats, and can make these conversions transparent to the ML practitioner.

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories