Last updated January 3, 2020

What Is The Secret Sauce Behind Google TPUs High Performance?

Published on September 3, 2019
by Ram Sagar

Google raked in over $8 billion in revenue last quarter and now they want to scale up their services in India by tapping into the public sector.

Back in April, CEO Thomas Kurian described Google Cloud’s new services to target different industries that includes media and entertainment, healthcare, retail, financial services, public sector. According to NASSCOM, the cloud market in India is likely to soar to $7.1 billion by 2022 with the developmental leaps in Big Data analytics, AI, ML and IoT.

As enterprises break monoliths apart and start modernising services, they need solutions for consistent service and traffic management at scale. Organisations want to invest time and resources in building applications and innovating, not on the infrastructure and networking required to deploy and manage these services. Machine learning being the hottest choice with enterprises currently, companies like Google, which outsource their technology, are leaving no stone unturned to notch up their infrastructure to meet the demands of the future. Their cloud TPU’s stand as testimony to their never-ending efforts.

Overview Of TPUs

Cloud TPU is designed to run cutting-edge machine learning models with AI services on Google Cloud. Its custom high-speed network offers over 100 petaflops of performance in a single pod — enough computational power to transform a business or create the next research breakthrough.

The second- and third-generation TPU chips are available to Google Cloud customers as Cloud TPUs. They deliver up to 420 teraflops per Cloud TPU device and more than 100 petaflops in a full Cloud TPU v3 Pod. Cloud TPUs achieve this high performance by uniting a well-established hardware architecture — the “systolic array” — with an innovative floating-point format.

FLOPs(Floating point operations per second) are units of measure of performance of a computational operation. A processor with a higher range of FLOPs is considered to be more powerful.

Allocating FLOPs(resource) during algorithmic operations(neural networks) is key to the time taken on training and other such fundamental operations.

In the case of Google TPU’s, the custom floating-point format is called “Brain Floating Point Format,” or “bfloat16” for short. The name flows from “Google Brain”, which is an artificial intelligence research group at Google where the idea for this format was conceived. Bfloat16 is carefully used within systolic arrays to accelerate matrix multiplication operations on Cloud TPUs.

What Edge Does Bfloat16 Give To TPUs?

Bfloat16 is a custom 16-bit floating-point format for machine learning that’s comprised of one sign bit, eight exponent bits, and seven mantissa bits. This is different from the industry-standard IEEE 16-bit floating-point, which was not designed with deep learning applications in mind.

Here are a few noticeable improvements achieved with Bfloat16:

Storing values in bfloat16 format saves on-chip memory, making 8 GB of memory per core feel more like 16 GB, and 16 GB feel more like 32 GB.
More extensive use of bfloat16 enables Cloud TPUs to train models that are deeper, wider or have larger inputs. And since larger models often lead to higher accuracy, this improves the ultimate quality of the products that depend on them.
Better compiler trade-offs between compute and memory saving can be achieved, resulting in performance improvements for large models.
Storing operands and outputs of those ops in the bfloat16 format reduces the amount of data that must be transferred, improving speed.

How ML Can Benefit

Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. Techniques have been developed in the past to train deep neural networks using half-precision floating-point numbers where the weights, activations and gradients are stored in IEEE half-precision format.

With bfloat16 too, there is a choice for each of the values of weights (parameters), activations, and gradients to be represented.

The team at Google Cloud, claim that some models are even more permissive, and in these cases representing both activations and weights in bfloat16 still leads to peak accuracy. So, the developers recommend keeping weights and gradients in FP32 but converting activations to bfloat16 and advise the ML practitioners to run an occasional baseline using FP32 for weights, gradients, and activations to ensure that the model behaviour is comparable.

It is believed that support for mixed-precision training throughout the TPU software stack allows for seamless conversion between the formats, and can make these conversions transparent to the ML practitioner.

Access all our open Survey & Awards Nomination forms in one place >>

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

What Is The Secret Sauce Behind Google TPUs High Performance?

Overview Of TPUs

What Edge Does Bfloat16 Give To TPUs?

How ML Can Benefit

Ram Sagar

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru