Home / News / Technology / Meta Shows off AI Infrastructure but Nvidia GPU Delay Threatens Progress
Technology
3 min read

Meta Shows off AI Infrastructure but Nvidia GPU Delay Threatens Progress

Published August 6, 2024 2:15 PM
James Morales
Published August 6, 2024 2:15 PM
By James Morales
Verified by Insha Zia

Key Takeaways

  • Meta has revealed the novel data center network design it uses to optimize AI training.
  • The new approach makes GPU clusters more efficient and scalable.
  • Nvidia has reportedly delayed the launch of its Blackwell range of AI chips.

One consequence of the Big Tech AI boom of the 2020s has been the accelerated development of AI infrastructure. 

With Silicon Valley giants spending billions to build ever-faster, more powerful computers for training their models, the latest GPU superclusters consist of tens of thousands of processors. Meanwhile, new ways of connecting them continue to emerge. 

Meta’s New AI Network Design

In a research blog  posted on Monday, Aug. 5, Meta engineers described a novel network topology they have developed to optimize AI training.

Pointing out that traditional data center networking infrastructure is ill-suited to large AI workloads, which require coordination between tens of thousands of GPUs, the researchers designed an alternative approach that uses RoCEv2 as the internode data transport standard.

In line with recent advances in AI supercluster design, Meta has also moved to a fabric-based architecture that offers greater scalability than the fixed topology it used in the past.

The specialized fabric provides high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location. This means Meta can dynamically ramp capacity up or down as needed.

However, although new network designs use GPU computing more efficiently and can tap additional processors as needed, the supply of high-end AI chips is currently limited by production constraints, restricting AI developers’ capacity.

Nvidia Chip Delays

Deliveries for Nvidia’s H100 chips currently have a two to three-month lead time. At points in 2023, clients were waiting up to eight months for the highly sought-after GPUs.

While Nvidia has managed to ramp up H100 production, the release of its next generation of processors has reportedly  been pushed back by three months amid ongoing design challenges.

First unveiled in March, the Blackwell range is set to offer a substantial performance boost for AI tasks. For instance, the flagship GB200 model will boast 20 petaflops per chip, a significant improvement compared to the H100’s four petaflops.

Inference Demand Increases

In addition to investing in training networks, Big Tech firms like Meta are increasingly focused on the challenge of AI inference, i.e., running data through trained models.

At the silicon level, Meta has developed its own inference accelerator  to power its increasingly sophisticated ranking and recommendation models and new solutions offered under the banner of Meta AI.

Apple ,  Microsoft, and Google  have all unveiled similar dedicated AI chips in the past year. Meanwhile, Google’s Trillium TPUs  (Tensor Processing Units) are posited as an alternative form of AI hardware that is equally capable of training and inference tasks.

The shift in emphasis from training to inference could see demand for AI-optimized CPUs increase as the industry moves beyond research and development to real-world applications and business models.

Was this Article helpful? Yes No