← Blog

Self-Hosted LLM: An Architect's Guide to TCO and When It Beats an API

A self hosted LLM is a GPU-utilization bet, not a privacy purchase. The real TCO drivers, serving stacks, a computable break-even, and when self-hosting beats an API.

Self-hosted LLM hero — a single opened GPU compute unit on a workbench, cooling fins catching the key light

Most writeups on self hosting llm infrastructure start with a docker command and end with a screenshot of a chat window. That's the easy part of the decision, and it's the part that fools teams into a budget they regret. The hard part is everything the chat window hides: a GPU that bills you whether or not anyone is using it, an on-call rotation that didn't exist before, and a model-update treadmill that a hosted API quietly runs for you. We've sat in the room when a CTO signs off on a cluster of H100s because privacy sounded non-negotiable, and we've watched the same cluster run far below break-even utilization a few months later, costing more per token than the API it replaced.

This is the decision guide we wish those teams had read first. It's written for the person who has to defend the infrastructure bill, not for the hobbyist running a private assistant on a gaming GPU. We'll define what a self hosted LLM actually is at the control boundary, show you the four cost drivers nobody quotes, prove that GPU utilization is the variable that decides the whole thing, compare the serving stacks you'd actually run in production, and give you a break-even you can compute with your own numbers. There's a runnable deployment near the end, and a decision framework that tells you, honestly, when to not self-host at all.

What a self hosted LLM actually is, and what it is not

A self hosted LLM is an open-weight model that runs on infrastructure you control: your own cloud GPUs, a private VPC, or on-prem servers, with a serving engine in front of it that turns model weights into an API your applications can call. The defining property is the control boundary. With a hosted API like AWS Bedrock or Azure OpenAI, the boundary sits at the network edge: you send tokens out and pay per token, and the provider owns the GPUs, the scaling, the patching, and the uptime. When you self-host, that boundary moves inward to your own racks or instances, and everything inside it becomes your responsibility, including the parts that don't show up in a deployment tutorial.

It helps to be precise about what self hosting an LLM is not. It is not the same as running a private chat app on your laptop with Ollama, which is a genuinely useful workflow but answers a personal-productivity question, not an infrastructure one. It is not fine-tuning, though people often conflate the two; you can self-host a stock open-weight model with zero fine-tuning, and you can fine-tune a model you then serve through a hosted endpoint. And it is not automatically more private in a way that matters legally, because a hosted API under a zero-retention enterprise agreement can satisfy the same compliance requirement that pushes many teams toward open source LLM hosting in the first place. The diagram below shows where the boundary actually moves.

Where the control boundary moves when you self-host
THE CONTROL BOUNDARYHOSTED APISELF HOSTED LLMyour appboundary = network edgeprovider GPUsscaling + uptimepatchingpay per tokenyour appserving engineGPU poolautoscalingpatch + on-calleverything left ofthe line is yourspay per GPU-hour
Hosted API: you own the application, the provider owns everything past the network edge. Self hosted LLM: the boundary moves inward, and the serving engine, GPU pool, autoscaling, and patching all become yours.

When self-hosting an LLM beats an API (and when it does not)

Self-hosting wins in a narrower band of cases than the marketing implies, and it loses badly outside that band. The honest answer is that it depends on three things: how much volume you push, how predictable that volume is, and whether a specific compliance or latency constraint genuinely forecloses a hosted option. A team running a few million tokens a month with spiky traffic and no hard data-residency rule should almost never self-host, because they'll pay for idle GPUs around the clock to serve a load that a per-token API would have priced at coffee money. A team running tens of billions of tokens a month at steady utilization, or one under a regulator that won't accept any third-party processor, is exactly who self hosting an LLM was built for.

The matrix below is how we frame this in a first working session with a client. Notice that the verdict rarely turns on the model and almost always turns on the load shape. This is the same lens we apply when an engagement starts with an honest AI readiness assessment: the workload decides the architecture, not the other way around. Read each row as a question about your real traffic, not your aspirational traffic.

Your situation Self-host winsHosted API winsWhy
Steady, high-volume inference Yes — utilization stays high No Amortized GPU undercuts per-token billing
Spiky or low volume No Yes — pay only for tokens used Idle GPUs bill 24/7 regardless of load
Hard data-residency / no third party Yes — boundary stays internal Only with zero-retention BAA Self-host removes the processor entirely
Need frontier-level quality Rarely Yes Best open-weight still trails the closed frontier on hard tasks
Tight, predictable tail latency Yes — dedicated capacity Depends on provider queue Your hardware, your queue, no noisy neighbors
Small team, no infra muscle No Yes Day-2 ops cost exceeds the API bill you avoided
The verdict almost always turns on load shape and a hard constraint, not on which open-weight model is best this quarter.

The real TCO of a self hosted LLM: the four cost drivers nobody quotes

When a deployment guide says self-hosting is cheaper at high volume, it's quietly assuming away three of the four things that actually drive total cost of ownership. The GPU is the only line item most people price, and it's the one that's easiest to get right. The other three (operations, utilization, and the model-update treadmill) are where the budget goes to die. Here's the shape of the four drivers before we dig into the one that dominates.

The GPU line is the most quoted and the least interesting, because it's a list price you can look up. A single H100 80GB on-demand sits in a broad band depending on cloud and region, and reserved pricing cuts it by a third to a half if you commit for a year. What that number does not tell you is your cost per token, because cost per token is the GPU's hourly price divided by the tokens you actually serve in that hour. Serve a lot of tokens and the GPU is cheap; serve a few and the same GPU is ruinous. That ratio is utilization, and it's the next section because it's the single number that decides whether self hosting an LLM was a good idea.

GPU economics: why utilization is the whole game

A dedicated GPU bills the same per hour whether it serves one request or saturates its batch. That single fact is the reason self-hosting is either a bargain or a disaster, with very little middle ground. Imagine a reserved H100 that costs you a fixed amount per hour. If it runs at 90% utilization, that hourly cost is spread across an enormous number of tokens and your amortized per-token price drops below what hosted mid-tier models like Claude Sonnet 4.6 or GPT-5 mini charge. If the same GPU runs at 15% utilization, which is exactly what happens when you provision for peak and your traffic is bursty, the identical hourly cost is spread across a sixth of the tokens, and your effective price per million tokens balloons four to six times above the hosted API you were trying to beat.

Effective cost per million tokens vs sustained GPU utilization
15% utilization (peak-provisioned, bursty)
100relative $/M-tok
Idle GPU, worst case — far above a hosted API
30% utilization (typical first deployment)
52relative $/M-tok
Still usually above mid-tier hosted
60% utilization (well-batched, steady)
26relative $/M-tok
Break-even territory vs hosted mid-tier
90% utilization (saturated, continuous batching)
17relative $/M-tok
Where self-hosting actually pays off

The bars above are relative, not absolute, because your actual numbers depend on the model size, the GPU, and your region. To make it concrete: in a 2026 internal sizing exercise on a 70B-class open-weight deployment, moving the same hardware from roughly 25% to 85% sustained utilization cut the effective price per million tokens by about 70%, with not one extra dollar of hardware spend. The shape is universal and it's the most important thing on this page: per-token cost is inversely proportional to utilization, so a self hosted LLM is a utilization bet first and a hardware purchase second. This is why continuous batching matters so much. A modern serving engine packs many concurrent requests into a single forward pass, which is the mechanism that lets you push utilization from the low range a naive deployment achieves up into the high range where the economics flip in your favor. Buy the GPU and forget the batching, and you've bought the expensive half of the deal without the half that pays for it.

The serving stack: vLLM, TGI, SGLang, Ollama, BentoML compared

The serving engine is the piece that turns weights into utilization, so it's not a detail you delegate to whatever the tutorial used. There are a handful that matter for a production self hosted LLM, and they trade off raw throughput against operational ergonomics. The table below is the honest version, including where each one is the wrong choice. The headline: vLLM is the throughput default most teams should start from, SGLang competes hard on structured-output-heavy and high-concurrency workloads, TGI is the well-trodden HuggingFace path, BentoML wraps any of them in deployment and autoscaling glue, and Ollama is for developer machines and prototypes, not for serving a team.

EngineBest forStrengthWhere it is the wrong choice
vLLMMost production servingPagedAttention + continuous batching = high utilizationBleeding-edge model formats land slightly later
SGLangHigh-concurrency, structured outputFast constrained decoding, strong throughputSmaller community than vLLM for edge cases
TGIHuggingFace-native shopsMature, well-documented, easy model pullsThroughput trails vLLM on some workloads
BentoMLPackaging + autoscaling any engineDeployment glue, scale-to-zero, ops surfaceNot an inference kernel itself — wraps one
OllamaDev machines, prototypes, demosOne-command local runs, great DXNot a multi-tenant production serving layer
The serving engines worth running in production, and where each one is the wrong tool.

Which open-weight model you serve on that engine is a separate axis, and it moves faster than any table can keep up with. As of 2026 the workhorses are the Llama 4 family for general production, Mistral and Mixtral variants where you want strong quality at smaller footprints, and DeepSeek models where reasoning and coding matter and you're comfortable with the operational footprint of a larger model. The serving engine and the model are independent choices: pick the engine for throughput and ops fit, pick the model for quality on your eval set, and never let one tutorial bundle the two decisions for you.

Reference architecture for a production self hosted LLM

A production self hosted LLM is not a single GPU with a model on it; it's a small distributed system with a few non-negotiable parts. In front sits a gateway that handles auth, rate limiting, and request logging, so that your applications talk to one stable endpoint instead of to raw inference replicas. Behind it, a router load-balances across a pool of serving replicas, each running your engine of choice on a GPU, with continuous batching keeping every replica busy. An autoscaler watches queue depth and adds or removes replicas, ideally with scale-to-zero for non-production tiers so idle environments stop billing. Around all of it, an observability layer captures latency, tokens, and errors per request. The SVG below is the shape we deploy from.

Production self hosted LLM reference architecture
SELF HOSTED LLM REFERENCE ARCHITECTUREclient appsOPENAI-COMPATgatewayauth / rate-limitrequest logrouterload balancevLLM replicaGPU + batchingvLLM replicaGPU + batchingvLLM replicaGPU + batchingautoscalerqueue depthscale-to-zeroobservability: per-request latency / tokens / errors (Langfuse)
Gateway for auth and rate limiting, a router across batched serving replicas on a GPU pool, an autoscaler driven by queue depth, and an observability layer (Langfuse) capturing per-request latency and tokens. Each box is a place cost or reliability can leak.

A deployment you can stand up this week

The day-1 deployment really is straightforward, and it's worth doing precisely so you can measure utilization on real traffic before you commit to reserved capacity. The three tabs below stand up a vLLM server with an OpenAI-compatible API, scale it to a few replicas on Kubernetes, and call it from application code exactly as you'd call a hosted endpoint. Run this against a shadow copy of your real traffic for a week, watch the utilization, and let that number decide the reserved-instance question.

serve.sh bash
# Stand up an OpenAI-compatible self hosted LLM with vLLM on one GPU.
# Continuous batching is on by default — this is what drives utilization.
pip install vllm

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 2 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.92 \
  --port 8000

# --max-num-seqs is your batch ceiling: raise it until GPU memory or
# tail latency pushes back. Higher batch = higher utilization = lower $/token.
deploy.yaml yaml
# Run N batched replicas behind one service. An HPA on queue depth
# (or GPU utilization) adds/removes replicas as load moves.
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-llama4 }
spec:
  replicas: 3
  selector: { matchLabels: { app: vllm-llama4 } }
  template:
    metadata: { labels: { app: vllm-llama4 } }
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args: ["--model", "meta-llama/Llama-4-Scout-17B-16E-Instruct",
                 "--tensor-parallel-size", "2", "--max-num-seqs", "256"]
          resources:
            limits: { nvidia.com/gpu: 2 }
---
apiVersion: v1
kind: Service
metadata: { name: vllm-llama4 }
spec:
  selector: { app: vllm-llama4 }
  ports: [{ port: 80, targetPort: 8000 }]
call.py python
# Your app calls the self-hosted endpoint exactly like a hosted API.
# Same OpenAI client, just a different base_url — zero application rewrite.
from openai import OpenAI

client = OpenAI(
    base_url="http://vllm-llama4.internal/v1",  # your gateway, not a provider
    api_key="local-or-gateway-key",
)

resp = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Summarize this contract clause..."}],
)
print(resp.choices[0].message.content)

Day-2 operations: the cost that hides in the org chart

The Reddit skeptics who say self-hosting is pointless are wrong about the conclusion and right about the cause: the model is never the bottleneck, operations are. Day-1 is a docker command; day-2 is a rotation. You now own GPU driver and CUDA upgrades, serving-engine version bumps, autoscaling that doesn't thrash, capacity planning when traffic grows, and an eval harness that catches regressions when you swap a model. None of that existed when you called a hosted API, and all of it competes for the same scarce ML-platform headcount you were going to use to build product. When the workload is genuinely distributed, those same operational muscles overlap with what we describe in our writeup on multi-agent orchestration patterns, because the failure modes of a fleet of inference replicas rhyme with the failure modes of a fleet of agents.

Hosted vs self hosted LLM: a break-even you can actually compute

Most hosted vs self hosted LLM comparisons stop at a vibe. You can do better in about ten minutes with numbers you already have. The comparison below frames the two sides honestly, and the code under it computes the monthly token volume at which a dedicated GPU undercuts a per-token API, given your real utilization. The break-even is not a universal constant; it's a function of your hardware cost, your achieved utilization, and the hosted price you're comparing against. Plug your own figures in and the answer stops being a debate.

Hosted API (Bedrock, Azure OpenAI, frontier vendors)

Pay per token, scale to zero automatically, zero ops headcount, frontier-quality models available immediately. You absorb a per-token premium and accept a third-party processor in your data path (mitigable with a zero-retention enterprise agreement). The right default until your volume is large and steady.

Self hosted LLM (vLLM/TGI on your GPUs)

Pay per GPU-hour whether busy or idle, own the full ops surface, choose any open-weight model, keep the control boundary internal. Cheaper per token only at high sustained utilization. The right choice at large steady volume or under a hard no-third-party constraint, and a money pit otherwise.

break_even.py python
def break_even_tokens_per_month(
    gpu_cost_per_hour: float,      # your reserved/on-demand $/GPU-hour
    tokens_per_sec_at_full: float, # measured peak throughput of the engine
    utilization: float,            # MEASURED sustained util, 0.0-1.0 (be honest)
    hosted_price_per_mtok: float,  # the API you'd otherwise pay, $/M tokens
) -> float:
    hours_per_month = 730
    # Effective tokens you actually serve, given real utilization.
    tokens_per_month = tokens_per_sec_at_full * utilization * 3600 * hours_per_month
    self_host_monthly = gpu_cost_per_hour * hours_per_month
    # $/M-token you actually pay self-hosting:
    self_host_per_mtok = self_host_monthly / (tokens_per_month / 1_000_000)
    # Below break-even, hosted is cheaper; above it, self-host wins.
    if self_host_per_mtok >= hosted_price_per_mtok:
        return float("inf")  # at this utilization, self-hosting never wins
    return tokens_per_month

# The lesson is in the util term: at 15% it often returns inf;
# at 90% the same hardware crosses break-even at a realistic volume.
Plug in your GPU hourly cost, achieved utilization, and the hosted price you'd otherwise pay. The function returns the monthly token volume above which self-hosting is cheaper. Utilization is the dominant term — halve it and the break-even roughly doubles.

Run that function with honest numbers and one of two things happens. Either it returns a finite token volume you're comfortably above, in which case self-hosting is a real saving and you should proceed. Or it returns infinity, which is the function's blunt way of telling you that at your achieved utilization, no token volume makes the dedicated GPU cheaper than the API, and you should stay hosted until your load shape changes. We've watched that single function end more self-hosting debates than any slide deck.

A decision framework for self hosting an LLM

Pull the threads together and the decision becomes a short sequence of gates rather than a coin flip. Start with the constraint check: is there a genuine, documented reason a hosted API is foreclosed, like a regulator that won't accept any third-party processor? If yes, self-hosting may be mandatory regardless of cost, and your job is to run it efficiently. If no, you're optimizing for money, and money is decided by utilization. Measure your sustained utilization on a shadow deployment, run the break-even, and only then commit hardware. The framework below is the order we actually walk it.

The self hosting decision, in the order to ask it
Hard constraint?
NO 3RD-PARTY / RESIDENCY
Measure util
SHADOW DEPLOY 1 WK
Run break-even
YOUR REAL NUMBERS
Cost ops load
ON-CALL + UPDATES
Decide
SELF-HOST OR STAY HOSTED

One honest caveat closes the loop: even when the math favors self-hosting, the best open-weight model you can run will usually trail the closed frontier on the hardest reasoning and coding tasks, the same way model-architecture choices carry real trade-offs we cover in our deep dive on model architecture differences. A common, sane outcome is a hybrid: self-host the high-volume, well-understood workload where utilization is high, and route the rare hard query to a hosted frontier model. You get the cost win where volume is steady and the quality win where it's needed, without pretending one model has to do everything.

FAQ: self hosting an LLM, in the practitioner's vocabulary

What is a self hosted LLM?

A self hosted LLM is an open-weight model running on infrastructure you control, served through an engine like vLLM or TGI that exposes an API your applications call. Unlike a hosted API such as AWS Bedrock, the control boundary sits inside your environment, so you own the GPUs, scaling, patching, and uptime. It is an infrastructure decision, not the same thing as running a private chat app locally.

When is self hosting an LLM cheaper than a hosted API?

Only at high sustained GPU utilization. A dedicated GPU bills the same per hour whether busy or idle, so your effective cost per token is the hourly price divided by tokens served. At 15-30% utilization, typical of a peak-provisioned first deployment, a self hosted LLM usually costs more per token than the hosted API. At 60-90% utilization with continuous batching, it undercuts mid-tier hosted models. Measure your real utilization before committing hardware.

What is the best serving engine for a self hosted LLM?

vLLM is the throughput default most production teams should start from, thanks to PagedAttention and continuous batching that keep utilization high. SGLang competes on high-concurrency and structured-output workloads, TGI is the mature HuggingFace-native path, and BentoML wraps any engine in deployment and autoscaling glue. Ollama is excellent for developer machines and prototypes but is not a multi-tenant production serving layer.

How much does it really cost to self-host an LLM?

Four drivers, not one. The GPU hours are the obvious and least interesting line. The hidden costs are operations (on-call, autoscaling, patching, eval regression), utilization (low utilization can multiply your effective per-token cost four to six times), and the model-update treadmill (re-eval and re-deploy every generation). Most teams price only the GPU and are surprised when the people and idle-capacity costs dominate the total.

Is open source LLM hosting more private than a hosted API?

Self-hosting removes the third-party processor entirely, which matters when a regulator or contract forbids any external processor. But a hosted API under a zero-retention enterprise agreement can satisfy many of the same compliance requirements. Privacy is a real reason to self-host only when the constraint genuinely forecloses a hosted option; otherwise it is often a preference dressed up as a requirement. Confirm the actual constraint before letting privacy drive the architecture.

Should a small team self-host an LLM?

Usually not. A self hosted LLM adds a day-2 operations load (driver upgrades, autoscaling, capacity planning, eval gates) that competes with the same scarce engineers who would otherwise build product. Unless you have a hard compliance constraint or genuinely large, steady volume, the hosted API you avoid is almost always cheaper than the ops headcount you take on. Start hosted, measure, and self-host only when the numbers and the constraint both point that way.

LLM DEVELOPMENT

Self-host on evidence, not on a privacy slogan.

We size the GPU, measure the utilization, compute the break-even, and cost the ops load before anyone buys hardware. If you're weighing a self hosted LLM against a hosted API, talk to the team that has run this analysis across 200+ engagements.

Talk to engineering

Want help shipping this?

An engineer reads every inbound. Same business day on most replies.