Key Takeaways
Training the largest AI models requires a lot of processing power—so much so that the largest AI developers build their own custom supercomputers housed in vast data centers to carry out the work.
Accordingly, Elon Musk’s plans to grow X into a major AI player have seen the company build the “Colossus Supercluster” – a stack of powerful Nvidia GPUs that xAI brought online at the weekend.
Superclusters are essentially large networks of interconnected Graphics Processing Units (GPUs) designed to handle the immense workloads required for advanced AI tasks, such as deep learning and large-scale data analysis.
Musk’s new supercomputer consists of 100,00 Nvidia H100 GPUs assembled using a low-latency network architecture optimized for AI training. The cluster will be further enhanced with an additional 50,000 H200s (which offer up to 45% better performance than H100s) “in a few months,” he stated.
GPUs are particularly well-suited for AI applications due to their ability to perform many parallel operations simultaneously. Unlike Central Processing Units (CPUs), which are better suited to sequential tasks, GPUs excel in handling the complex matrix and vector operations that are fundamental to AI algorithms.
Amid surging demand for GPU compute, data centers have moved to ramp up their capacity. However, the field is evolving rapidly and even AI-training computers that were considered the best in the world a few years ago pale in comparison to the new Colosuss Supercluster
Of course, at $40,000 a piece, H200s aren’t cheap. But X’s latest supercomputer could soon be outgunned by one currently being built by Microsoft.
Earlier this year, reports surfaced that the Big Tech firm is investing $100 billion into a new AI training platform to replace the 10,000 GPU cluster it built for OpenAI in 2020.
Dubbed “Stargate,” the new data center is expected to be operational by 2028 when it will be used to drive the next stage of Microsoft and OpenAI’s model training.
Meanwhile, Meta is also investing heavily in its AI hardware stack. In January, CEO Mark Zuckerberg said the company is building “an absolutely massive amount of infrastructure” to support AI training. “By the end of this year, we’re going to have around 350,000 Nvidia H100s. Or around 600,000 H100 equivalents of compute if you include other GPUs,” he added.
With tech firms competing for who can build the largest, most powerful AI cluster, Musk’s claim that “Colossus is the most powerful AI training system in the world” may not hold up for long.
Neither is the competition as simple as who can build the biggest computer. As major cloud providers that operate dozens of data centers each, companies like Meta, Google and Microsoft have access to GPU resources equivalent to multiple Colossus’.
In 2023, Nvidia reportedly shipped Meta and Microsoft 150,000 H100s each, while Google received a further 50,000. As the chipmaker has started to phase out H100 chips in 2024, a new generation of H200-based supercomputers has also emerged, pushing the bar even higher. Meanwhile, innovations in data center network design continue to optimize AI infrastructure, letting developers do more with less.
Since acquiring X (then Twitter) in 2022, Elon Musk has embarked on a major business overhaul that places a significant emphasis on AI.
Musk incorporated xAI last year to help carry out his plans for the platform. Positioning the startup as an OpenAI rival from the get-go, xAI moved quickly to release the ChatGPT alternative Grok in November.
Looking ahead, Musk aims to leverage AI to create a platform that goes beyond traditional social media, transforming X into a multifaceted digital hub that spans financial services, communication tools and more.
While GPU superclusters have certainly fueled the generative AI revolution up until now, there is no guarantee that ever-more powerful computers will continue to deliver results. Rapidly depleting data resources may also stall growth. Meanwhile, increasingly capable small models could make the current Big Tech size contest less commercially relevant.
For now, xAI and OpenAI own some of the world’s most powerful computers. But the history of technology teaches us that the biggest spender doesn’t always win, and a single disruptive innovation can reset the entire race.