Serverless vs. Provisioned: The VRAM Bottleneck

In the CPU world (AWS Lambda), "Serverless" is magical. You send a request, a micro-VM spins up in 20ms, runs your code, and dies. It works because a Node.js runtime is tiny (50MB) and RAM is fast.

In the GPU world, "Serverless" is a physics problem.

To run Llama-3-70B, you need to load 140GB of weights into GPU VRAM. Even with PCIe Gen 5 (64GB/s bandwidth), this takes several seconds. You cannot spin up a GPU in 20ms.

This creates the fundamental dilemma of AI Infrastructure: Do I keep the GPU warm (Provisioned) or accept the latency (Serverless)?

Part 1: The Physics of Cold Starts

Let's do the math on loading a large model. This is the math that keeps Infrastructure Engineers up at night.

The Physics of VRAM Loading:
Model Size: 70 Billion Parameters (FP16)
Weight Size: 140 Gigabytes (2 bytes per param)
PCIe Gen4 Bandwidth: ~32 GB/s (Real-world max)
Minimum Load Time: 140 GB / 32 GB/s = 4.375 Seconds

This is the theoretical minimum. In reality, with overhead, initialization, and CUDA context creation, a "Cold Start" for a 70B model is often 15-30 seconds. For a Chatbot, this is unacceptable. A user will refresh the page if the loading spinner spins for 3 seconds.

The Energy Cost of Cold Starts

It's not just time; it's energy. Spinning up a GPU, loading 140GB of data, and then shutting it down uses significant power. If you only process 1 token, the "Energy Overhead" of the cold start is 99% of the total energy used. This is why "Scale to Zero" is environmentally expensive for AI.

The Break-Even Calculus (2025 Estimates)
| Traffic Pattern | Serverless (Lambda) Cost | Provisioned (EC2) Cost | Winner |
| :--- | :--- | :--- | :--- |
| Spiky (0 to 10k req/sec) | $500/mo | $1,200/mo (Over-provisioned) | Serverless |
| Steady (50 req/sec) | $2,000/mo | $400/mo (t3.medium) | Provisioned |
| AI Inference (GPU) | $10,000/mo (Endpoint) | $2,500/mo (Dedicated A10G) | Provisioned |

JSON

# Tool: AWS Lambda Power Tuning
# Don't guess memory settings. Measure them.
# This state machine runs your function at 128MB, 256MB, ... 2048MB.
# It plots Cost vs Speed.

{
  "Comment": "Power Tuning State Machine",
  "StartAt": "Initialize",
  "States": {
    "Initialize": {
      "Type": "Pass",
      "Next": "ParallelExec"
    },
    "ParallelExec": {
      "Type": "Parallel",
      "Branches": [
        { "StartAt": "Run128MB", "States": { ... } },
        { "StartAt": "Run256MB", "States": { ... } }
      ],
      "Next": "Visualize"
    }
  }
}

Part 2: The Two Patterns

1. Serverless Inference (Token-Based)

Vendors: OpenAI API, Anthropic API, AWS Bedrock (On-Demand), Replicate.

How it works: The vendor runs a massive cluster of permanently warm GPUs. They multiplex thousands of customers onto the same batch.

Pros: Pay per token. Zero idle cost. Ideal for "Spiky" traffic.
Cons: No guarantee of throughput. You share the queue with everyone else. If a new viral app launches (like ChatGPT did), your latency spikes because the shared queues are full.

2. Provisioned Throughput (Time-Based)

Vendors: AWS Bedrock Provisioned, Azure PTU, SageMaker Endpoints, BentoML.

How it works: You reserve a specific number of H100s (e.g., 8 GPUs) for a specific time (e.g., 1 month). They are dedicated to you. They sit idle when you don't use them.

Pros: Guaranteed latency. No cold starts. Privacy (single tenant).
Cons: You pay even if no one is using it ($30/hour/instance). This is "Capital Expenditure" logic in an OpEx world.

Part 3: The Tipping Point Calculation

When does it make sense to switch from Pay-per-Token to Pay-per-Hour? This "Break-Even Point" is the most common question in AI Finance.

Let's assume:

Serverless Cost: $5.00 / 1M tokens.
Provisioned Cost: $20.00 / hour (Capacity: ~1000 tokens/sec implies ~3.6M tokens/hour max).

If you run at 100% utilization, Provisioned is drastically cheaper.

The Rule of Thumb: If you are processing more than 10 million tokens per day, Provisioned usually becomes cheaper AND faster. Below that, stay Serverless.

Part 4: Technical Deep Dive: Terraform for Provisioned Bedrock

Setting up Provisioned Throughput is not just a UI click. It requires Infrastructure as Code to ensure you don't accidentally leave a $20/hour instance running for a year (that's $175,000).

Terraform

resource "aws_bedrock_provisioned_model_throughput" "main" {
  provisioned_model_name = "my-company-throughput-v1"
  model_id               = "anthropic.claude-3-sonnet-20240229-v1:0"
  model_units            = 1 # 1 Unit = Specific capacity (e.g. 15k tokens/min)
  commitment_duration    = "OneMonth" # Discounts for longer commitments
}

resource "aws_lambda_function" "invoker" {
  # Environment variable points to the ARN of the Provisioned Model
  environment {
    variables = {
      MODEL_ARN = aws_bedrock_provisioned_model_throughput.main.provisioned_model_arn
    }
  }
}

Part 5: The KV Cache Problem (State)

State is the enemy of serverless.

In LLMs, the "KV Cache" (Key-Value Cache) stores the attention mechanism's history.

If you send a 10,000-token prompt, the GPU calculates the KV Cache (500MB).

If the next request (the follow-up question) lands on a different GPU, that cache is lost. It must be recomputed. This wastes money.

Future Tech: Prompt Caching. Vendors like Anthropic are introducing "Prompt Caching" where they keep the KV Cache alive for 5 minutes. This allows "Serverless" to behave like "Stateful," dramatically reducing costs for long-context chats.

Part 6: LoRA Adapters - The Middle Way

There is a compromise.

LoRA (Low-Rank Adaptation) allows you to fine-tune a model with a tiny adapter file (100MB) instead of the full weights (140GB).

Serverless LoRA Architecture:

Keep the Base Model (Llama 3) frozen in VRAM (Shared across 500 customers).
When a user request comes in, swap in only their 100MB LoRA adapter.
Run interference.
Swap it out.

This allows "Multi-Tenant" fine-tuning. It is how platforms like Predibase and Replicate work. It enables the "App Store of Models" where every user has a custom model, but the backend is shared.

Part 7: Provider Landscape Analysis

Provider	Type	Best Use Case
AWS Bedrock	Enterprise Wrapper	Security compliance (HIPAA/SOC2).
Replicate	Serverless Endpoint	Hobbyists, Prototyping, Image Gen (Flux/SD).
Modal	Python Container	Engineers who want total control over the CUDA kernel.
RunPod	GPU Rental	Lowest cost. You manage the Docker container manually.

Part 8: Implementation Checklist

Calculate your daily token/request volume. Is it > 10M tokens?
Test Cold Starts: If >5s is unacceptable, use Provisioned.
Analyze VRAM requirements: Does your model fit on an A10G (24GB) or do you need an H100 (80GB)?
Use Terraform: Automate the provisioning/deprovisioning of PTUs.

Part 9: Glossary

Cold Start: The latency incurred when loading a model into VRAM.
KV Cache: Key-Value cache of attention states. Critical for chat performance.
LoRA: Low-Rank Adaptation. A method to fine-tune large models cheaply.
Provisioned Throughput: Reserving dedicated GPU capacity for a fixed fee.
Over-provisioning: Buying more capacity than you need to handle spikes.
Scale to Zero: The ability to pay $0 when no traffic is present.
Concurrency: The number of requests a function can handle simultaneously.

Pro Tip: The AWS Savings Plan Hack
Most people think Savings Plans only apply to EC2. Wrong.
Compute Savings Plans apply to Fargate and Lambda too.
If you commit to $10/hour of spend, you get a 17% discount on Lambda. This is "Free Money" if you have a baseline load. Always checking this can save thousands.

Conclusion

For 99% of startups, Serverless is the right choice. The op-ex of managing GPUs (and SRE salaries) is not worth it.

For Enterprises with strict SLAs and predictable load, Provisioned is mandatory.

The future is Hybrid: Serverless for the "Base Performance" and LoRA adapters for the "Vertical Expertise."

See, Understand, Optimize -
All in One Place

Atler Pilot decodes your cloud spend story by bringing monitoring, automation, and intelligent insights together for faster and better cloud operations.