AI factories are factories: Overcoming industry challenges to commoditize AI


This article is part of VentureBeat's special issue, “AI at Scale: From Vision to Viability.” Read more from this special issue here.

This article is part of VentureBeat's special issue, “AI at Scale: From Vision to Viability.” Read more from the issue here.

If you travel 60 years back to Stevenson, Alabama, you'll find the Widows Creek Fossil Plant, a 1.6-gigawatt generating station with one of the tallest chimneys in the world. now, there is a Google data center where the Widows Creek plant once stood. Instead of running on coal, the old facility's transmission lines carry renewable energy to power the company's online services.

That metamorphosis, from a carbon-burning facility to a digital factory, is symbolic of a global transition to digital infrastructure. And we will soon see the production of intelligence kick into high gear thanks to AI factories.

These data centers are decision-making machines that gobble up compute, networking and storage resources while turning information into insights. Densely populated data centers are sprouting in record time to meet the insatiable demand for artificial intelligence.

The infrastructure to support AI inherits many of the same challenges that define industrial factories, from power to scalability and reliability, requiring modern solutions to centuries-old problems.

The new workforce: Compute power

In the age of steam and iron, labor meant thousands of workers operating machinery around the clock. In today's AI factories, output is determined by compute power. Training large AI models requires enormous processing resources. According to Aparna Ramani, VP of engineering at Metathe growth of training in these models is about a factor of four per year throughout the industry.

That level of scaling is on track to create some of the same bottlenecks that existed in the industrial world. There are supply chain constraints, to begin with. GPUs — the engines of the AI ​​revolution — come from several manufacturers. They are incredibly complex. They are in high demand. And so they should not be surprised subject to cost volatility.

In an effort to circumvent some of those supply limitations, big names like AWS, Google, IBM, Intel and Meta are designing their own custom silicon. These chips are optimized for power, performance and cost, making them specialists with unique features for their respective workloads.

This change isn't just about the hardware, though. There is also concern about how AI technologies will affect the job market. Research published by Columbia Business School studied the investment management industry and found that the adoption of AI leads to a 5% decrease in the share of labor income, mirroring the changes seen during the Industrial Revolution.

“AI is likely to be transformative for many, perhaps all, sectors of the economy,” said Professor Laura Veldkamp, ​​one of the authors of the paper. “I am quite hopeful that we will find gainful employment for many people. But there will be moving costs.

Where do we find energy to measure?

Cost and availability aside, the GPUs that serve as the AI ​​factory workforce are notoriously power hungry. When the xAI team brought its Colossus supercomputer cluster online in September 2024, it reportedly had access to somewhere between seven and eight megawatts from the Tennessee Valley Authority. But the cluster's 100,000 H100 GPUs require more than that. So, xAI brought VoltaGrid mobile generators to temporarily make up the difference. In early November, Memphis Light, Gas & Water reached a more permanent agreement with TVA to deliver xAI an additional 150 megawatts of capacity. But critics counter that the site's consumption strains the city's grid and contributes to its poor air quality. And Elon Musk already have plans for another 100,000 H100/H200 GPUs under the same roof.

According to McKinseythe power needs of data centers are expected to increase to approximately three times the current capacity by the end of the decade. At the same time, the speed at which processors double their performance efficiency is slowing down. That means performance per watt is still improving, but at an accelerating rate, and certainly not fast enough to keep up with the demand for horsepower calculations.

So, what will it take to match the feverish adoption of AI technologies? A report from Goldman Sachs suggests that US utilities will need to invest about $50 billion in new generation capacity just to support data centers. Analysts also expect data center electricity consumption to drive about 3.3 billion cubic feet per day of new natural gas demand by 2030.

Scaling becomes more difficult as AI factories grow

Training the models that make AI factories accurate and efficient can take tens of thousands of GPUs, all working in parallel, months at a time. If a GPU fails during training, the run must be stopped, restored to a recent checkpoint and resumed. However, as the complexity of AI factories increases, so does the probability of failure. Ramani addressed this concern during a AI Infra @ Scale presentation.

“Stopping and restarting is a bit of a pain. But it's exacerbated by the fact that, as the number of GPUs increases, so does the probability of failure. And at some point, the number of failures can become overwhelming. We lost too much time to reduce these failures and you can hardly finish a training.”

According to Ramani, Meta is taking proactive steps to detect failures earlier and to get back up and running faster. Further on the horizon, research into asynchronous training can improve fault tolerance while simultaneously improving GPU utilization and distributing training running across multiple data centers.

Always-on AI will change the way we do business

Just as factories once relied on new technologies and organizational models to scale up production of goods, AI factories feed on compute power, networking infrastructure and storage to produce tokens — the smallest piece of information used by the AI ​​model.

“This AI factory is developing, creating, producing something of great value, a new commodity,” Nvidia CEO Jensen Huang said in his The keynote of Computex 2024. “It's completely usable in almost every industry. And that's why it's a new Industrial Revolution.”

McKinsey says generative AI has the potential to add equivalent to $2.6 to $4.4 trillion in annual economic benefits across 63 different use cases. In every application, whether the AI ​​factory is hosted in the cloud, deployed at the edge or self-managed, the same infrastructure challenges must be overcome, as in an industrial factory. According to the same McKinsey report, achieving even a quarter of that growth by the end of the decade will require another 50 to 60 gigawatts of data center capacity, to begin with.

But the outcome of this growth is poised to change the IT industry irrevocably. Huang explained that AI factories will make it possible for the IT industry to generate intelligence for the $100 trillion industry. “It will be a manufacturing industry. Not a computers manufacturing industry, but the use of computers in manufacturing. This has never happened before. A rare thing.”


Leave a Reply

Your email address will not be published. Required fields are marked *