Self-Hosted LLM: An Architect's Guide to TCO and When It Beats an API
A self hosted LLM is a GPU-utilization bet, not a privacy purchase. The real TCO drivers, serving stacks, a computable break-even, and when self-hosting beats an API.
Most writeups on self hosting llm infrastructure start with a docker command and end with a screenshot of a chat window. That's the easy part of the decision, and it's the part that fools teams into a budget they regret. The hard part is everything the chat window hides: a GPU that bills you whether or not anyone is using it, an on-call rotation that didn't exist before, and a model-update treadmill that a hosted API quietly runs for you. We've sat in the room when a CTO signs off on a cluster of H100s because privacy sounded non-negotiable, and we've watched the same cluster run far below break-even utilization a few months later, costing more per token than the API it replaced.
This is the decision guide we wish those teams had read first. It's written for the person who has to defend the infrastructure bill, not for the hobbyist running a private assistant on a gaming GPU. We'll define what a self hosted LLM actually is at the control boundary, show you the four cost drivers nobody quotes, prove that GPU utilization is the variable that decides the whole thing, compare the serving stacks you'd actually run in production, and give you a break-even you can compute with your own numbers. There's a runnable deployment near the end, and a decision framework that tells you, honestly, when to not self-host at all.
What a self hosted LLM actually is, and what it is not
A self hosted LLM is an open-weight model that runs on infrastructure you control: your own cloud GPUs, a private VPC, or on-prem servers, with a serving engine in front of it that turns model weights into an API your applications can call. The defining property is the control boundary. With a hosted API like AWS Bedrock or Azure OpenAI, the boundary sits at the network edge: you send tokens out and pay per token, and the provider owns the GPUs, the scaling, the patching, and the uptime. When you self-host, that boundary moves inward to your own racks or instances, and everything inside it becomes your responsibility, including the parts that don't show up in a deployment tutorial.
It helps to be precise about what self hosting an LLM is not. It is not the same as running a private chat app on your laptop with Ollama, which is a genuinely useful workflow but answers a personal-productivity question, not an infrastructure one. It is not fine-tuning, though people often conflate the two; you can self-host a stock open-weight model with zero fine-tuning, and you can fine-tune a model you then serve through a hosted endpoint. And it is not automatically more private in a way that matters legally, because a hosted API under a zero-retention enterprise agreement can satisfy the same compliance requirement that pushes many teams toward open source LLM hosting in the first place. The diagram below shows where the boundary actually moves.
When self-hosting an LLM beats an API (and when it does not)
Self-hosting wins in a narrower band of cases than the marketing implies, and it loses badly outside that band. The honest answer is that it depends on three things: how much volume you push, how predictable that volume is, and whether a specific compliance or latency constraint genuinely forecloses a hosted option. A team running a few million tokens a month with spiky traffic and no hard data-residency rule should almost never self-host, because they'll pay for idle GPUs around the clock to serve a load that a per-token API would have priced at coffee money. A team running tens of billions of tokens a month at steady utilization, or one under a regulator that won't accept any third-party processor, is exactly who self hosting an LLM was built for.
| Your situation | Self-host wins | Hosted API wins | Why |
|---|---|---|---|
| Steady, high-volume inference | Yes — utilization stays high | No | Amortized GPU undercuts per-token billing |
| Spiky or low volume | No | Yes — pay only for tokens used | Idle GPUs bill 24/7 regardless of load |
| Hard data-residency / no third party | Yes — boundary stays internal | Only with zero-retention BAA | Self-host removes the processor entirely |
| Need frontier-level quality | Rarely | Yes | Best open-weight still trails the closed frontier on hard tasks |
| Tight, predictable tail latency | Yes — dedicated capacity | Depends on provider queue | Your hardware, your queue, no noisy neighbors |
| Small team, no infra muscle | No | Yes | Day-2 ops cost exceeds the API bill you avoided |
The real TCO of a self hosted LLM: the four cost drivers nobody quotes
When a deployment guide says self-hosting is cheaper at high volume, it's quietly assuming away three of the four things that actually drive total cost of ownership. The GPU is the only line item most people price, and it's the one that's easiest to get right. The other three (operations, utilization, and the model-update treadmill) are where the budget goes to die. Here's the shape of the four drivers before we dig into the one that dominates.
The GPU line is the most quoted and the least interesting, because it's a list price you can look up. A single H100 80GB on-demand sits in a broad band depending on cloud and region, and reserved pricing cuts it by a third to a half if you commit for a year. What that number does not tell you is your cost per token, because cost per token is the GPU's hourly price divided by the tokens you actually serve in that hour. Serve a lot of tokens and the GPU is cheap; serve a few and the same GPU is ruinous. That ratio is utilization, and it's the next section because it's the single number that decides whether self hosting an LLM was a good idea.
GPU economics: why utilization is the whole game
A dedicated GPU bills the same per hour whether it serves one request or saturates its batch. That single fact is the reason self-hosting is either a bargain or a disaster, with very little middle ground. Imagine a reserved H100 that costs you a fixed amount per hour. If it runs at 90% utilization, that hourly cost is spread across an enormous number of tokens and your amortized per-token price drops below what hosted mid-tier models like Claude Sonnet 4.6 or GPT-5 mini charge. If the same GPU runs at 15% utilization, which is exactly what happens when you provision for peak and your traffic is bursty, the identical hourly cost is spread across a sixth of the tokens, and your effective price per million tokens balloons four to six times above the hosted API you were trying to beat.
The bars above are relative, not absolute, because your actual numbers depend on the model size, the GPU, and your region. To make it concrete: in a 2026 internal sizing exercise on a 70B-class open-weight deployment, moving the same hardware from roughly 25% to 85% sustained utilization cut the effective price per million tokens by about 70%, with not one extra dollar of hardware spend. The shape is universal and it's the most important thing on this page: per-token cost is inversely proportional to utilization, so a self hosted LLM is a utilization bet first and a hardware purchase second. This is why continuous batching matters so much. A modern serving engine packs many concurrent requests into a single forward pass, which is the mechanism that lets you push utilization from the low range a naive deployment achieves up into the high range where the economics flip in your favor. Buy the GPU and forget the batching, and you've bought the expensive half of the deal without the half that pays for it.
The serving stack: vLLM, TGI, SGLang, Ollama, BentoML compared
The serving engine is the piece that turns weights into utilization, so it's not a detail you delegate to whatever the tutorial used. There are a handful that matter for a production self hosted LLM, and they trade off raw throughput against operational ergonomics. The table below is the honest version, including where each one is the wrong choice. The headline: vLLM is the throughput default most teams should start from, SGLang competes hard on structured-output-heavy and high-concurrency workloads, TGI is the well-trodden HuggingFace path, BentoML wraps any of them in deployment and autoscaling glue, and Ollama is for developer machines and prototypes, not for serving a team.
| Engine | Best for | Strength | Where it is the wrong choice |
|---|---|---|---|
| vLLM | Most production serving | PagedAttention + continuous batching = high utilization | Bleeding-edge model formats land slightly later |
| SGLang | High-concurrency, structured output | Fast constrained decoding, strong throughput | Smaller community than vLLM for edge cases |
| TGI | HuggingFace-native shops | Mature, well-documented, easy model pulls | Throughput trails vLLM on some workloads |
| BentoML | Packaging + autoscaling any engine | Deployment glue, scale-to-zero, ops surface | Not an inference kernel itself — wraps one |
| Ollama | Dev machines, prototypes, demos | One-command local runs, great DX | Not a multi-tenant production serving layer |
Which open-weight model you serve on that engine is a separate axis, and it moves faster than any table can keep up with. As of 2026 the workhorses are the Llama 4 family for general production, Mistral and Mixtral variants where you want strong quality at smaller footprints, and DeepSeek models where reasoning and coding matter and you're comfortable with the operational footprint of a larger model. The serving engine and the model are independent choices: pick the engine for throughput and ops fit, pick the model for quality on your eval set, and never let one tutorial bundle the two decisions for you.
Reference architecture for a production self hosted LLM
A production self hosted LLM is not a single GPU with a model on it; it's a small distributed system with a few non-negotiable parts. In front sits a gateway that handles auth, rate limiting, and request logging, so that your applications talk to one stable endpoint instead of to raw inference replicas. Behind it, a router load-balances across a pool of serving replicas, each running your engine of choice on a GPU, with continuous batching keeping every replica busy. An autoscaler watches queue depth and adds or removes replicas, ideally with scale-to-zero for non-production tiers so idle environments stop billing. Around all of it, an observability layer captures latency, tokens, and errors per request. The SVG below is the shape we deploy from.
A deployment you can stand up this week
The day-1 deployment really is straightforward, and it's worth doing precisely so you can measure utilization on real traffic before you commit to reserved capacity. The three tabs below stand up a vLLM server with an OpenAI-compatible API, scale it to a few replicas on Kubernetes, and call it from application code exactly as you'd call a hosted endpoint. Run this against a shadow copy of your real traffic for a week, watch the utilization, and let that number decide the reserved-instance question.
# Stand up an OpenAI-compatible self hosted LLM with vLLM on one GPU.
# Continuous batching is on by default — this is what drives utilization.
pip install vllm
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 2 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.92 \
--port 8000
# --max-num-seqs is your batch ceiling: raise it until GPU memory or
# tail latency pushes back. Higher batch = higher utilization = lower $/token. # Run N batched replicas behind one service. An HPA on queue depth
# (or GPU utilization) adds/removes replicas as load moves.
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-llama4 }
spec:
replicas: 3
selector: { matchLabels: { app: vllm-llama4 } }
template:
metadata: { labels: { app: vllm-llama4 } }
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"--tensor-parallel-size", "2", "--max-num-seqs", "256"]
resources:
limits: { nvidia.com/gpu: 2 }
---
apiVersion: v1
kind: Service
metadata: { name: vllm-llama4 }
spec:
selector: { app: vllm-llama4 }
ports: [{ port: 80, targetPort: 8000 }] # Your app calls the self-hosted endpoint exactly like a hosted API.
# Same OpenAI client, just a different base_url — zero application rewrite.
from openai import OpenAI
client = OpenAI(
base_url="http://vllm-llama4.internal/v1", # your gateway, not a provider
api_key="local-or-gateway-key",
)
resp = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[{"role": "user", "content": "Summarize this contract clause..."}],
)
print(resp.choices[0].message.content) Day-2 operations: the cost that hides in the org chart
The Reddit skeptics who say self-hosting is pointless are wrong about the conclusion and right about the cause: the model is never the bottleneck, operations are. Day-1 is a docker command; day-2 is a rotation. You now own GPU driver and CUDA upgrades, serving-engine version bumps, autoscaling that doesn't thrash, capacity planning when traffic grows, and an eval harness that catches regressions when you swap a model. None of that existed when you called a hosted API, and all of it competes for the same scarce ML-platform headcount you were going to use to build product. When the workload is genuinely distributed, those same operational muscles overlap with what we describe in our writeup on multi-agent orchestration patterns, because the failure modes of a fleet of inference replicas rhyme with the failure modes of a fleet of agents.
Hosted vs self hosted LLM: a break-even you can actually compute
Most hosted vs self hosted LLM comparisons stop at a vibe. You can do better in about ten minutes with numbers you already have. The comparison below frames the two sides honestly, and the code under it computes the monthly token volume at which a dedicated GPU undercuts a per-token API, given your real utilization. The break-even is not a universal constant; it's a function of your hardware cost, your achieved utilization, and the hosted price you're comparing against. Plug your own figures in and the answer stops being a debate.
Pay per token, scale to zero automatically, zero ops headcount, frontier-quality models available immediately. You absorb a per-token premium and accept a third-party processor in your data path (mitigable with a zero-retention enterprise agreement). The right default until your volume is large and steady.
Pay per GPU-hour whether busy or idle, own the full ops surface, choose any open-weight model, keep the control boundary internal. Cheaper per token only at high sustained utilization. The right choice at large steady volume or under a hard no-third-party constraint, and a money pit otherwise.
def break_even_tokens_per_month(
gpu_cost_per_hour: float, # your reserved/on-demand $/GPU-hour
tokens_per_sec_at_full: float, # measured peak throughput of the engine
utilization: float, # MEASURED sustained util, 0.0-1.0 (be honest)
hosted_price_per_mtok: float, # the API you'd otherwise pay, $/M tokens
) -> float:
hours_per_month = 730
# Effective tokens you actually serve, given real utilization.
tokens_per_month = tokens_per_sec_at_full * utilization * 3600 * hours_per_month
self_host_monthly = gpu_cost_per_hour * hours_per_month
# $/M-token you actually pay self-hosting:
self_host_per_mtok = self_host_monthly / (tokens_per_month / 1_000_000)
# Below break-even, hosted is cheaper; above it, self-host wins.
if self_host_per_mtok >= hosted_price_per_mtok:
return float("inf") # at this utilization, self-hosting never wins
return tokens_per_month
# The lesson is in the util term: at 15% it often returns inf;
# at 90% the same hardware crosses break-even at a realistic volume. Run that function with honest numbers and one of two things happens. Either it returns a finite token volume you're comfortably above, in which case self-hosting is a real saving and you should proceed. Or it returns infinity, which is the function's blunt way of telling you that at your achieved utilization, no token volume makes the dedicated GPU cheaper than the API, and you should stay hosted until your load shape changes. We've watched that single function end more self-hosting debates than any slide deck.
A decision framework for self hosting an LLM
Pull the threads together and the decision becomes a short sequence of gates rather than a coin flip. Start with the constraint check: is there a genuine, documented reason a hosted API is foreclosed, like a regulator that won't accept any third-party processor? If yes, self-hosting may be mandatory regardless of cost, and your job is to run it efficiently. If no, you're optimizing for money, and money is decided by utilization. Measure your sustained utilization on a shadow deployment, run the break-even, and only then commit hardware. The framework below is the order we actually walk it.
One honest caveat closes the loop: even when the math favors self-hosting, the best open-weight model you can run will usually trail the closed frontier on the hardest reasoning and coding tasks, the same way model-architecture choices carry real trade-offs we cover in our deep dive on model architecture differences. A common, sane outcome is a hybrid: self-host the high-volume, well-understood workload where utilization is high, and route the rare hard query to a hosted frontier model. You get the cost win where volume is steady and the quality win where it's needed, without pretending one model has to do everything.
FAQ: self hosting an LLM, in the practitioner's vocabulary
What is a self hosted LLM?
A self hosted LLM is an open-weight model running on infrastructure you control, served through an engine like vLLM or TGI that exposes an API your applications call. Unlike a hosted API such as AWS Bedrock, the control boundary sits inside your environment, so you own the GPUs, scaling, patching, and uptime. It is an infrastructure decision, not the same thing as running a private chat app locally.
When is self hosting an LLM cheaper than a hosted API?
Only at high sustained GPU utilization. A dedicated GPU bills the same per hour whether busy or idle, so your effective cost per token is the hourly price divided by tokens served. At 15-30% utilization, typical of a peak-provisioned first deployment, a self hosted LLM usually costs more per token than the hosted API. At 60-90% utilization with continuous batching, it undercuts mid-tier hosted models. Measure your real utilization before committing hardware.
What is the best serving engine for a self hosted LLM?
vLLM is the throughput default most production teams should start from, thanks to PagedAttention and continuous batching that keep utilization high. SGLang competes on high-concurrency and structured-output workloads, TGI is the mature HuggingFace-native path, and BentoML wraps any engine in deployment and autoscaling glue. Ollama is excellent for developer machines and prototypes but is not a multi-tenant production serving layer.
How much does it really cost to self-host an LLM?
Four drivers, not one. The GPU hours are the obvious and least interesting line. The hidden costs are operations (on-call, autoscaling, patching, eval regression), utilization (low utilization can multiply your effective per-token cost four to six times), and the model-update treadmill (re-eval and re-deploy every generation). Most teams price only the GPU and are surprised when the people and idle-capacity costs dominate the total.
Is open source LLM hosting more private than a hosted API?
Self-hosting removes the third-party processor entirely, which matters when a regulator or contract forbids any external processor. But a hosted API under a zero-retention enterprise agreement can satisfy many of the same compliance requirements. Privacy is a real reason to self-host only when the constraint genuinely forecloses a hosted option; otherwise it is often a preference dressed up as a requirement. Confirm the actual constraint before letting privacy drive the architecture.
Should a small team self-host an LLM?
Usually not. A self hosted LLM adds a day-2 operations load (driver upgrades, autoscaling, capacity planning, eval gates) that competes with the same scarce engineers who would otherwise build product. Unless you have a hard compliance constraint or genuinely large, steady volume, the hosted API you avoid is almost always cheaper than the ops headcount you take on. Start hosted, measure, and self-host only when the numbers and the constraint both point that way.
Self-host on evidence, not on a privacy slogan.
We size the GPU, measure the utilization, compute the break-even, and cost the ops load before anyone buys hardware. If you're weighing a self hosted LLM against a hosted API, talk to the team that has run this analysis across 200+ engagements.