Alphabet Inc.’s Google has released new details about the supercomputers it uses to train its AI models. The company said its AI supercomputers are both faster and more power-efficient than comparable systems from Nvidia Corp. Google uses its custom-designed Tensor Processing Units (TPUs) for over 90% of its work on artificial intelligence training. These chips are used to feed data through models to make them useful for tasks such as responding to queries with human-like text or generating images. Google’s TPU is now in its fourth generation. The company published a scientific paper on Tuesday, detailing how it has strung more than 4,000 of these chips together into a supercomputer using its own custom-developed optical switches to connect individual machines.
The ability to connect machines in a supercomputer has become a key point of competition among companies that build AI supercomputers, as language models that power technologies like Google’s Bard or OpenAI’s ChatGPT have exploded in size. These models are far too large to store on a single chip and must instead be split across thousands of chips, which must then work together for weeks or more to train the model. Google’s largest publicly disclosed language model to date, the PaLM model, was trained by splitting it across two of the 4,000-chip supercomputers over 50 days.
Google said that its supercomputers make it easy to reconfigure connections between chips on the fly, which helps avoid problems and tweak for performance gains. The company also noted that circuit switching makes it easy to route around failed components. The flexibility of the system even allows Google to change the topology of the supercomputer interconnect to accelerate the performance of an ML (machine learning) model.
Although Google only recently released details about its supercomputer, it has been online inside the company since 2020 in a data centre in Mayes County, Oklahoma. Google said that startup Midjourney used the system to train its model, which generates fresh images after being fed a few words of text. In the scientific paper, Google said that for comparably sized systems, its supercomputer is up to 1.7 times faster and 1.9 times more power-efficient than a system based on Nvidia’s A100 chip that was on the market at the same time as the fourth-generation TPU. However, Google said it did not compare its fourth generation to Nvidia’s current flagship H100 chip because the H100 came to the market after Google’s chip and is made with newer technology.