In the CPU world (AWS Lambda), "Serverless" is magical. You send a request, a micro-VM spins up in 20ms, runs your code, and dies. It works because a Node.js runtime is tiny (50MB) and RAM is fast.
In the GPU world, "Serverless" is a physics problem.
To run Llama-3-70B, you need to load 140GB of weights into GPU VRAM. Even with PCIe Gen 5 (64GB/s bandwidth), this takes several seconds. You cannot spin up a GPU in 20ms.
This creates the fundamental dilemma of AI Infrastructure: Do I keep the GPU warm (Provisioned) or accept the latency (Serverless)?
Part 1: The Physics of Cold Starts
Let's do the math on loading a large model. This is the math that keeps Infrastructure Engineers up at night.
The Physics of VRAM Loading:
Model Size: 70 Billion Parameters (FP16)
Weight Size: 140 Gigabytes (2 bytes per param)
PCIe Gen4 Bandwidth: ~32 GB/s (Real-world max)
Minimum Load Time: 140 GB / 32 GB/s = 4.375 Seconds
This is the theoretical minimum. In reality, with overhead, initialization, and CUDA context creation, a "Cold Start" for a 70B model is often 15-30 seconds. For a Chatbot, this is unacceptable. A user will refresh the page if the loading spinner spins for 3 seconds.
The Energy Cost of Cold Starts
It's not just time; it's energy. Spinning up a GPU, loading 140GB of data, and then shutting it down uses significant power. If you only process 1 token, the "Energy Overhead" of the cold start is 99% of the total energy used. This is why "Scale to Zero" is environmentally expensive for AI.
The Break-Even Calculus (2025 Estimates)
| Traffic Pattern | Serverless (Lambda) Cost | Provisioned (EC2) Cost | Winner |
| :--- | :--- | :--- | :--- |
| Spiky (0 to 10k req/sec) | $500/mo | $1,200/mo (Over-provisioned) | Serverless |
| Steady (50 req/sec) | $2,000/mo | $400/mo (t3.medium) | Provisioned |
| AI Inference (GPU) | $10,000/mo (Endpoint) | $2,500/mo (Dedicated A10G) | Provisioned |
JSON
# Tool: AWS Lambda Power Tuning
# Don't guess memory settings. Measure them.
# This state machine runs your function at 128MB, 256MB, ... 2048MB.
# It plots Cost vs Speed.
{
"Comment": "Power Tuning State Machine",
"StartAt": "Initialize",
"States": {
"Initialize": {
"Type": "Pass",
"Next": "ParallelExec"
},
"ParallelExec": {
"Type": "Parallel",
"Branches": [
{ "StartAt": "Run128MB", "States": { ... } },
{ "StartAt": "Run256MB", "States": { ... } }
],
"Next": "Visualize"
}
}
}
Part 2: The Two Patterns
1. Serverless Inference (Token-Based)
Vendors: OpenAI API, Anthropic API, AWS Bedrock (On-Demand), Replicate.
How it works: The vendor runs a massive cluster of permanently warm GPUs. They multiplex thousands of customers onto the same batch.
Pros: Pay per token. Zero idle cost. Ideal for "Spiky" traffic.
Cons: No guarantee of throughput. You share the queue with everyone else. If a new viral app launches (like ChatGPT did), your latency spikes because the shared queues are full.
2. Provisioned Throughput (Time-Based)
Vendors: AWS Bedrock Provisioned, Azure PTU, SageMaker Endpoints, BentoML.
How it works: You reserve a specific number of H100s (e.g., 8 GPUs) for a specific time (e.g., 1 month). They are dedicated to you. They sit idle when you don't use them.
Pros: Guaranteed latency. No cold starts. Privacy (single tenant).
Cons: You pay even if no one is using it ($30/hour/instance). This is "Capital Expenditure" logic in an OpEx world.
Part 3: The Tipping Point Calculation
When does it make sense to switch from Pay-per-Token to Pay-per-Hour? This "Break-Even Point" is the most common question in AI Finance.
Let's assume:
Serverless Cost: $5.00 / 1M tokens.
Provisioned Cost: $20.00 / hour (Capacity: ~1000 tokens/sec implies ~3.6M tokens/hour max).
If you run at 100% utilization, Provisioned is drastically cheaper.
The Rule of Thumb: If you are processing more than 10 million tokens per day, Provisioned usually becomes cheaper AND faster. Below that, stay Serverless.
Part 4: Technical Deep Dive: Terraform for Provisioned Bedrock
Setting up Provisioned Throughput is not just a UI click. It requires Infrastructure as Code to ensure you don't accidentally leave a $20/hour instance running for a year (that's $175,000).
Terraform
resource "aws_bedrock_provisioned_model_throughput" "main" {
provisioned_model_name = "my-company-throughput-v1"
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model_units = 1 # 1 Unit = Specific capacity (e.g. 15k tokens/min)
commitment_duration = "OneMonth" # Discounts for longer commitments
}
resource "aws_lambda_function" "invoker" {
# Environment variable points to the ARN of the Provisioned Model
environment {
variables = {
MODEL_ARN = aws_bedrock_provisioned_model_throughput.main.provisioned_model_arn
}
}
}
Part 5: The KV Cache Problem (State)
State is the enemy of serverless.
In LLMs, the "KV Cache" (Key-Value Cache) stores the attention mechanism's history.
If you send a 10,000-token prompt, the GPU calculates the KV Cache (500MB).
If the next request (the follow-up question) lands on a different GPU, that cache is lost. It must be recomputed. This wastes money.
Future Tech: Prompt Caching. Vendors like Anthropic are introducing "Prompt Caching" where they keep the KV Cache alive for 5 minutes. This allows "Serverless" to behave like "Stateful," dramatically reducing costs for long-context chats.
Part 6: LoRA Adapters - The Middle Way
There is a compromise.
LoRA (Low-Rank Adaptation) allows you to fine-tune a model with a tiny adapter file (100MB) instead of the full weights (140GB).
Serverless LoRA Architecture:
Keep the Base Model (Llama 3) frozen in VRAM (Shared across 500 customers).
When a user request comes in, swap in only their 100MB LoRA adapter.
Run interference.
Swap it out.
This allows "Multi-Tenant" fine-tuning. It is how platforms like Predibase and Replicate work. It enables the "App Store of Models" where every user has a custom model, but the backend is shared.
Part 7: Provider Landscape Analysis
Provider | Type | Best Use Case |
AWS Bedrock | Enterprise Wrapper | Security compliance (HIPAA/SOC2). |
Replicate | Serverless Endpoint | Hobbyists, Prototyping, Image Gen (Flux/SD). |
Modal | Python Container | Engineers who want total control over the CUDA kernel. |
RunPod | GPU Rental | Lowest cost. You manage the Docker container manually. |
Part 8: Implementation Checklist
Calculate your daily token/request volume. Is it > 10M tokens?
Test Cold Starts: If >5s is unacceptable, use Provisioned.
Analyze VRAM requirements: Does your model fit on an A10G (24GB) or do you need an H100 (80GB)?
Use Terraform: Automate the provisioning/deprovisioning of PTUs.
Part 9: Glossary
Cold Start: The latency incurred when loading a model into VRAM.
KV Cache: Key-Value cache of attention states. Critical for chat performance.
LoRA: Low-Rank Adaptation. A method to fine-tune large models cheaply.
Provisioned Throughput: Reserving dedicated GPU capacity for a fixed fee.
Over-provisioning: Buying more capacity than you need to handle spikes.
Scale to Zero: The ability to pay $0 when no traffic is present.
Concurrency: The number of requests a function can handle simultaneously.
Pro Tip: The AWS Savings Plan Hack
Most people think Savings Plans only apply to EC2. Wrong.
Compute Savings Plans apply to Fargate and Lambda too.
If you commit to $10/hour of spend, you get a 17% discount on Lambda. This is "Free Money" if you have a baseline load. Always checking this can save thousands.
Conclusion
For 99% of startups, Serverless is the right choice. The op-ex of managing GPUs (and SRE salaries) is not worth it.
For Enterprises with strict SLAs and predictable load, Provisioned is mandatory.
The future is Hybrid: Serverless for the "Base Performance" and LoRA adapters for the "Vertical Expertise."
All in One Place
Atler Pilot decodes your cloud spend story by bringing monitoring, automation, and intelligent insights together for faster and better cloud operations.

