Thinking about Training you own AI Model?

Read this before you make any decision. This explains the two paths you can take depending on the size of your AI Model.

Apr 17, 2026

One of my client approached me last month and they wanted to train a proprietary foundation model entirely from scratch.

They wanted to own their intelligence and stop paying external API fees.

Though they had a massive budget but they had absolutely no clue how to actually build the system.

They assumed they could just buy a thousand graphics cards and string them together with standard network cables.

I had to stop them from wasting millions of dollars. While researching the exact architecture required to build their cluster I found the reality absolutely fascinating.

I wondered how so few people in our industry truly understand the brutal math and physics behind distributed training.

Based on my research and understanding here is what I know so far.

To understand this physical reality we need to divide the training landscape into two distinct parts.

The SLM (Local)

If you want to train a Small Language Model (SLM) or fine tune an existing model for a highly scoped internal routing task you can rely on a single local system.

You do not need a massive data center. You just need to understand the physics of a single motherboard.

The Software

To make raw silicon do math you need a translator. In the local training environment you have two distinct software paths with completely different hardware requirements.

The first path is CUDA.

Nvidia built CUDA to be their proprietary programming interface. It converts high level Python code into low level parallel math instructions that the hardware can physically execute.

If you use CUDA you must buy Nvidia graphics cards.

The second path is MLX.

Apple built the MLX framework to compete directly with CUDA for local execution.

MLX is designed exclusively for Apple Silicon. It allows developers to run complex machine learning math on standard Mac Studio desktops.

MLX is still very new and not matured as compared to CUDA, it also has a very limited library support, which make CUDA a better choice.

The Shared Memory Advantage

When training locally the physical layout of the memory dictates your performance.

In a standard Nvidia PC build the system relies on separated memory.

You must copy your training data from the system RAM across the PCIe bus into the dedicated VRAM of the graphics card.

Apple avoids this entirely with Unified Memory.

The central processor and the graphics processor share the exact same physical memory pool.

When you load your training data into an Apple system there is zero data copying required.

The processors simply pass a pointer to the shared memory block. This shared memory architecture makes Apple Silicon incredibly efficient for training small models on a single local machine.

Apple’s Unified Memory gives you a huge advantage here.

The Frontier (Cluster)

When you leave the local machine to train a billion or trillion parameter model you enter a realm governed by strict physical capacity.

You are no longer building a computer. You are building a cluster of computer which acts as a supercomputer.

The Software

Apple and MLX completely disappear at this scale.

Nvidia dominates the frontier because CUDA scales perfectly across thousands of machines.

Nvidia spent a decade ensuring frameworks like PyTorch default exclusively to CUDA for distributed workloads.

The software moat forces you to buy into their enterprise hardware ecosystem.

The Hardware

We hit a hardware capacity wall very quickly when building foundation models.

A 70 billion parameter model consumes terabytes of memory. You have to store the massive weight matrices the optimizer states and the continuous training batches.

No single GPU on earth holds that much data.

The only mathematical solution is Data Parallelism. You must build a distributed cluster.

You copy the exact same model across thousands of distinct GPUs. You slice the massive training dataset into smaller manageable chunks and feed them to the separate GPUs simultaneously.

Gradient Synchronization

Having thousands of processors working at once sounds incredibly efficient until you realize they have to talk to each other constantly.

As GPUs process their data chunks they calculate gradients.

Gradients are massive mathematical vectors that tell the model exactly how to adjust its weights to decrease errors.

This is the literal process of machine learning.

This requirement introduces a massive architectural bottleneck. Before moving to the next training step every single one of those thousands of GPUs must share its gradients. They must average them together and update their local copies identically so the cluster learns as one single brain.

This Gradient Synchronization happens millions of times per run.

The NCCL

Moving this much data simultaneously breaks normal computer networks.

If ten thousand GPUs broadcast massive files simultaneously over a standard network the data center collapses instantly.

The GPUs finish their math in milliseconds and then sit completely idle waiting for network switches to clear the traffic jam.

Nvidia solved this network choke with a software tool called the Nvidia Collective Communications Library or NCCL.

NCCL uses a brilliant mathematical layout called Ring All Reduce.

It arranges GPUs in a logical ring. Data is broken into small chunks and passed strictly to immediate neighbors instead of broadcasting to everyone.

For a cluster of N GPUs and a data size D the data sent and received is bound by this exact formula.

\(Data=\frac{2(N-1)}{N}D\)

Because the fraction approaches 1 the total data transferred by any single node never exceeds 2D.

This mathematical proof guarantees the network will not choke regardless of how many thousands of GPUs you add to the cluster.

The NVLink

Software optimization can only take you so far before physics gets in the way.

Even with NCCL mathematically optimizing the traffic we still hit a physical motherboard limit. Moving hundreds of gigabytes over a standard PCIe connection is cripplingly slow.

A standard PCIe bus maxes out around 64 gigabytes per second. That is far too slow for gradient synchronization.

Nvidia built NVLink to bypass the motherboard entirely.

NVLink is a proprietary physical bridge connecting GPUs directly to each other with thick copper cables.

It delivers a staggering 1.8 terabytes per second of bidirectional bandwidth.

It transfers calculations so fast that the entire server rack functions as a single unified processor.

The Cost

We need to connect this complex architecture back to the financial reality facing my client.

We must evaluate the strict economic difference between the local path and the frontier path before we reach a final decision.

The SLM (Local)

Training a Small Language Model (SLM) locally is a highly predictable financial commitment.

You buy the hardware once. You plug it into a standard wall outlet.

The unified memory architecture allows your engineers to experiment and fail rapidly without incurring hourly cloud compute penalties.

Your total financial risk is strictly capped at the initial purchase price of the desktop machine.

This makes local SLM development an incredible bargain for enterprise teams building scoped internal tools.

The Frontier (Cluster)

Training a frontier model is a completely different financial universe.

It requires ten thousand to one hundred thousand GPUs running continuously for months.

Without NCCL and NVLink a cluster spends forty percent of its time waiting for data transfers.

When your cloud bill is hundreds of thousands of dollars a day idle compute is pure financial hemorrhage.

You are burning cash to power hardware that is doing absolutely nothing.

The physics directly dictates the invoice.

Conclusion

We need to weigh the ambition of training proprietary models against the harsh reality of data center physics before jumping in.

If you want to train a small model locally the unified memory of an Apple system running MLX is an incredible and cost effective engineering marvel.

You avoid the vendor lock in and bypass the PCIe bottleneck entirely on a single machine but you have deal with the drawbacks of MLX in general.

If you want to compete at the frontier and train massive models you have absolutely no choice.

You are buying into a closed high speed distributed network ecosystem. You cannot build a custom cluster with cheap networking and expect it to survive the synchronization penalty.

Nvidia dominates because they control the entire vertical stack.

Their true moat is not just the silicon chip processing the math.

Their moat is the complex distributed software and the thick physical cables tying the entire data center together.

Discussion about this post

Ready for more?