The Memory Wall: Bandwidth vs. FLOPS

When Nvidia announces a new chip, the marketing slide always screams about FLOPS (Floating Point Operations Per Second). "4 PetaFLOPS of Compute!" It sounds amazing.

But for Generative AI (LLMs), FLOPS are meaningless. It is a vanity metric. The only number that matters is Memory Bandwidth (TB/s).

The Arithmetic Intensity Problem

LLMs are mathematically simple but massively heavy. To generate just one single token, the GPU must load the entire model (all 70 Billion parameters) from Memory into the Compute Cores, do one small calculation, and then ask for the next token.

The ratio of "Math done" to "Data moved" is called Arithmetic Intensity.

Intensity = Ops / Bytes

LLMs have extremely low arithmetic intensity. They are basically memory copy operations disguised as AI. This leads to the Memory Wall: The compute cores (the engines) are sitting idle 90% of the time, waiting for the memory (the fuel line) to deliver the data.

HBM3e: The Real Innovation

This explains why the Nvidia H100 is so expensive. It’s not the compute; it’s the HBM3e (High Bandwidth Memory).

Nvidia A100: 2.0 TB/s Bandwidth.
Nvidia H100: 3.35 TB/s Bandwidth.
Nvidia H200: 4.8 TB/s Bandwidth.

The performance jump from H100 to H200 is almost perfectly correlated with that bandwidth number. They are widening the fuel line.

Architecting for the Wall

When choosing hardware, match the chip to the problem intensity:

High Intensity (Stable Diffusion / Image Gen): These models do a lot of math on small pixels. FLOPS matter here. Use standard GPUs (L40S / A10).
Low Intensity (LLM Text Gen): These are memory bound. You must buy Bandwidth. This is why running Llama-3-70B on an A100 is painful—it's bandwidth starved. You need the H100 or H200.

Conclusion

When negotiating with your cloud provider, stop asking about "CUDA Cores." Ask about "Memory Bandwidth per Dollar." That is the metric that determines your inference speed.

See, Understand, Optimize -
All in One Place

Atler Pilot decodes your cloud spend story by bringing monitoring, automation, and intelligent insights together for faster and better cloud operations.