# Self-Hosted LLM: An Architect's Guide to TCO and When It Beats an API

> A self hosted LLM is a GPU-utilization bet, not a privacy purchase. The real TCO drivers, serving stacks, a computable break-even, and when self-hosting beats an API.

**HTML version:** https://www.paiteq.com/blog/self-hosted-llm/
**Published:** 2026-06-11T02:50:22.459Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~13 min


---

Most writeups on self hosting llm infrastructure start with a docker command and end with a screenshot of a chat window. That's the easy part of the decision, and it's the part that fools teams into a budget they regret. The hard part is everything the chat window hides: a GPU that bills you whether or not anyone is using it, an on-call rotation that didn't exist before, and a model-update treadmill that a hosted API quietly runs for you. We've sat in the room when a CTO signs off on a cluster of H100s because privacy sounded non-negotiable, and we've watched the same cluster run far below break-even utilization a few months later, costing more per token than the API it replaced.
This is the decision guide we wish those teams had read first. It's written for the person who has to defend the infrastructure bill, not for the hobbyist running a private assistant on a gaming GPU. We'll define what a self hosted LLM actually is at the control boundary, show you the four cost drivers nobody quotes, prove that GPU utilization is the variable that decides the whole thing, compare the serving stacks you'd actually run in production, and give you a break-even you can compute with your own numbers. There's a runnable deployment near the end, and a decision framework that tells you, honestly, when to not self-host at all.

## What a self hosted LLM actually is, and what it is not

A self hosted LLM is an open-weight model that runs on infrastructure you control: your own cloud GPUs, a private VPC, or on-prem servers, with a serving engine in front of it that turns model weights into an API your applications can call. The defining property is the control boundary. With a hosted API like AWS Bedrock or Azure OpenAI, the boundary sits at the network edge: you send tokens out and pay per token, and the provider owns the GPUs, the scaling, the patching, and the uptime. When you self-host, that boundary moves inward to your own racks or instances, and everything inside it becomes your responsibility, including the parts that don't show up in a deployment tutorial.
It helps to be precise about what self hosting an LLM is not. It is not the same as running a private chat app on your laptop with Ollama, which is a genuinely useful workflow but answers a personal-productivity question, not an infrastructure one. It is not fine-tuning, though people often conflate the two; you can self-host a stock open-weight model with zero fine-tuning, and you can fine-tune a model you then serve through a hosted endpoint. And it is not automatically more private in a way that matters legally, because a hosted API under a zero-retention enterprise agreement can satisfy the same compliance requirement that pushes many teams toward open source LLM hosting in the first place. The diagram below shows where the boundary actually moves.

## When self-hosting an LLM beats an API (and when it does not)

Self-hosting wins in a narrower band of cases than the marketing implies, and it loses badly outside that band. The honest answer is that it depends on three things: how much volume you push, how predictable that volume is, and whether a specific compliance or latency constraint genuinely forecloses a hosted option. A team running a few million tokens a month with spiky traffic and no hard data-residency rule should almost never self-host, because they'll pay for idle GPUs around the clock to serve a load that a per-token API would have priced at coffee money. A team running tens of billions of tokens a month at steady utilization, or one under a regulator that won't accept any third-party processor, is exactly who self hosting an LLM was built for.
[The matrix below is how we frame this in a first working session with a client. Notice that the verdict rarely turns on the model and almost always turns on the load shape. This is the same lens we apply when an engagement starts with an honest AI readiness assessment: the workload decides the architecture, not the other way around. Read each row as a question about your real traffic, not your aspirational traffic.](/blog/ai-readiness-assessment/)

## The real TCO of a self hosted LLM: the four cost drivers nobody quotes

When a deployment guide says self-hosting is cheaper at high volume, it's quietly assuming away three of the four things that actually drive total cost of ownership. The GPU is the only line item most people price, and it's the one that's easiest to get right. The other three (operations, utilization, and the model-update treadmill) are where the budget goes to die. Here's the shape of the four drivers before we dig into the one that dominates.
The GPU line is the most quoted and the least interesting, because it's a list price you can look up. A single H100 80GB on-demand sits in a broad band depending on cloud and region, and reserved pricing cuts it by a third to a half if you commit for a year. What that number does not tell you is your cost per token, because cost per token is the GPU's hourly price divided by the tokens you actually serve in that hour. Serve a lot of tokens and the GPU is cheap; serve a few and the same GPU is ruinous. That ratio is utilization, and it's the next section because it's the single number that decides whether self hosting an LLM was a good idea.

## GPU economics: why utilization is the whole game

A dedicated GPU bills the same per hour whether it serves one request or saturates its batch. That single fact is the reason self-hosting is either a bargain or a disaster, with very little middle ground. Imagine a reserved H100 that costs you a fixed amount per hour. If it runs at 90% utilization, that hourly cost is spread across an enormous number of tokens and your amortized per-token price drops below what hosted mid-tier models like Claude Sonnet 4.6 or GPT-5 mini charge. If the same GPU runs at 15% utilization, which is exactly what happens when you provision for peak and your traffic is bursty, the identical hourly cost is spread across a sixth of the tokens, and your effective price per million tokens balloons four to six times above the hosted API you were trying to beat.
The bars above are relative, not absolute, because your actual numbers depend on the model size, the GPU, and your region. To make it concrete: in a 2026 internal sizing exercise on a 70B-class open-weight deployment, moving the same hardware from roughly 25% to 85% sustained utilization cut the effective price per million tokens by about 70%, with not one extra dollar of hardware spend. The shape is universal and it's the most important thing on this page: per-token cost is inversely proportional to utilization, so a self hosted LLM is a utilization bet first and a hardware purchase second. This is why continuous batching matters so much. A modern serving engine packs many concurrent requests into a single forward pass, which is the mechanism that lets you push utilization from the low range a naive deployment achieves up into the high range where the economics flip in your favor. Buy the GPU and forget the batching, and you've bought the expensive half of the deal without the half that pays for it.

> [!NOTE] (rich block: callout)

## The serving stack: vLLM, TGI, SGLang, Ollama, BentoML compared

The serving engine is the piece that turns weights into utilization, so it's not a detail you delegate to whatever the tutorial used. There are a handful that matter for a production self hosted LLM, and they trade off raw throughput against operational ergonomics. The table below is the honest version, including where each one is the wrong choice. The headline: vLLM is the throughput default most teams should start from, SGLang competes hard on structured-output-heavy and high-concurrency workloads, TGI is the well-trodden HuggingFace path, BentoML wraps any of them in deployment and autoscaling glue, and Ollama is for developer machines and prototypes, not for serving a team.
Which open-weight model you serve on that engine is a separate axis, and it moves faster than any table can keep up with. As of 2026 the workhorses are the Llama 4 family for general production, Mistral and Mixtral variants where you want strong quality at smaller footprints, and DeepSeek models where reasoning and coding matter and you're comfortable with the operational footprint of a larger model. The serving engine and the model are independent choices: pick the engine for throughput and ops fit, pick the model for quality on your eval set, and never let one tutorial bundle the two decisions for you.

## Reference architecture for a production self hosted LLM

A production self hosted LLM is not a single GPU with a model on it; it's a small distributed system with a few non-negotiable parts. In front sits a gateway that handles auth, rate limiting, and request logging, so that your applications talk to one stable endpoint instead of to raw inference replicas. Behind it, a router load-balances across a pool of serving replicas, each running your engine of choice on a GPU, with continuous batching keeping every replica busy. An autoscaler watches queue depth and adds or removes replicas, ideally with scale-to-zero for non-production tiers so idle environments stop billing. Around all of it, an observability layer captures latency, tokens, and errors per request. The SVG below is the shape we deploy from.

## A deployment you can stand up this week

The day-1 deployment really is straightforward, and it's worth doing precisely so you can measure utilization on real traffic before you commit to reserved capacity. The three tabs below stand up a vLLM server with an OpenAI-compatible API, scale it to a few replicas on Kubernetes, and call it from application code exactly as you'd call a hosted endpoint. Run this against a shadow copy of your real traffic for a week, watch the utilization, and let that number decide the reserved-instance question.

## Day-2 operations: the cost that hides in the org chart

The Reddit skeptics who say self-hosting is pointless are wrong about the conclusion and right about the cause: the model is never the bottleneck, operations are. Day-1 is a docker command; day-2 is a rotation. You now own GPU driver and CUDA upgrades, serving-engine version bumps, autoscaling that doesn't thrash, capacity planning when traffic grows, and an eval harness that catches regressions when you swap a model. None of that existed when you called a hosted API, and all of it competes for the same scarce ML-platform headcount you were going to use to build product. When the workload is genuinely distributed, those same operational muscles overlap with what we describe in our writeup on [multi-agent orchestration patterns](/blog/multi-agent-orchestration-patterns/), because the failure modes of a fleet of inference replicas rhyme with the failure modes of a fleet of agents.

## Hosted vs self hosted LLM: a break-even you can actually compute

Most hosted vs self hosted LLM comparisons stop at a vibe. You can do better in about ten minutes with numbers you already have. The comparison below frames the two sides honestly, and the code under it computes the monthly token volume at which a dedicated GPU undercuts a per-token API, given your real utilization. The break-even is not a universal constant; it's a function of your hardware cost, your achieved utilization, and the hosted price you're comparing against. Plug your own figures in and the answer stops being a debate.
Run that function with honest numbers and one of two things happens. Either it returns a finite token volume you're comfortably above, in which case self-hosting is a real saving and you should proceed. Or it returns infinity, which is the function's blunt way of telling you that at your achieved utilization, no token volume makes the dedicated GPU cheaper than the API, and you should stay hosted until your load shape changes. We've watched that single function end more self-hosting debates than any slide deck.

## A decision framework for self hosting an LLM

Pull the threads together and the decision becomes a short sequence of gates rather than a coin flip. Start with the constraint check: is there a genuine, documented reason a hosted API is foreclosed, like a regulator that won't accept any third-party processor? If yes, self-hosting may be mandatory regardless of cost, and your job is to run it efficiently. If no, you're optimizing for money, and money is decided by utilization. Measure your sustained utilization on a shadow deployment, run the break-even, and only then commit hardware. The framework below is the order we actually walk it.
One honest caveat closes the loop: even when the math favors self-hosting, the best open-weight model you can run will usually trail the closed frontier on the hardest reasoning and coding tasks, the same way model-architecture choices carry real trade-offs we cover in our deep dive on [model architecture differences](/blog/diffusion-vs-flow-models/). A common, sane outcome is a hybrid: self-host the high-volume, well-understood workload where utilization is high, and route the rare hard query to a hosted frontier model. You get the cost win where volume is steady and the quality win where it's needed, without pretending one model has to do everything.

## FAQ: self hosting an LLM, in the practitioner's vocabulary

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/
