# LLM gateway: what it does and when you need one

> An infra buyer's guide to the LLM gateway: routing, fallback, cost control, guardrails and observability, the build-vs-buy decision, and when you don't need one.

**HTML version:** https://www.paiteq.com/blog/llm-gateway/
**Published:** 2026-06-10T04:36:41.556Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~13 min


---

Every team that runs more than one large language model in production eventually draws the same box on its architecture diagram: a single layer sitting between the application and the providers, with every model call passing through it. That box is the **llm gateway**. The search results for the term are a wall of vendors explaining the concept right up to the point where the explanation becomes "...and that's why you should buy ours." We don't sell a gateway. We build LLM systems, and we've put gateways in front of them, watched them become the most load-bearing box on the diagram, and ripped a few out when they turned out to be solving a problem the team didn't have yet.
This is the guide we wish existed when we were making that build-versus-buy call: what an LLM gateway actually does, how a request moves through one, where it earns its keep on cost and governance, and the honest answer to the question every vendor page avoids, which is when you don't need one at all. The keyword has a $18 cost-per-click, which tells you the audience reading it is evaluating a purchase. So we'll treat you like a buyer, not a lead.

## LLM gateway in one paragraph — and the one question it's really answering

An LLM gateway is a proxy that sits between your applications and one or more model providers and gives you a single, consistent API in front of all of them. Instead of each service holding its own OpenAI key, its own Anthropic key, and its own retry logic, every call goes to the gateway, which authenticates it, decides which model should serve it, optionally checks a cache, applies guardrails, forwards the request to the provider, and logs the whole thing. The pattern is borrowed wholesale from the API gateway that's sat in front of web services for a decade: cross-cutting concerns like authentication and rate limiting don't belong inside every app, so you pull them out into one box.
The question an LLM gateway is really answering isn't "how do I call a model," because the provider SDKs do that fine. It's "how do I run model traffic across a growing number of providers, teams, and products without re-implementing routing, fallback, cost tracking, and safety in every codebase, and without being married to a single vendor." If you only ever call one model from one provider, that question doesn't bite yet. The moment you have a second provider or a second team, it does.

## What an LLM gateway actually does: the five jobs under the hood

Strip away the marketing and a gateway does five jobs. Most products bundle all five; the cheaper open-source proxies do the first two well and bolt the rest on. Knowing which jobs you actually need is the whole build-versus-buy decision, so it's worth being precise about each.
Job one is access unification: present an OpenAI-compatible API so every downstream service speaks one dialect, while the gateway translates to whatever each provider expects. Job two is routing and fallback, which means picking a model per request and failing over to another when the first errors or times out. Job three is cost control: metering spend per team, caching responses, and routing cheap workloads to cheap models. Job four is governance, covering auth, rate limits, prompt-injection and PII guardrails, and audit logging. Job five is observability: the traces, token counts, and latency you can actually attribute. The phrase *ai gateway* usually means a gateway that does all five plus tool and agent coordination, a distinction we come back to later.

## The request path: how a call moves through an LLM gateway

It helps to follow a single request end to end. A chat completion leaves your service with the gateway's API key, not the provider's. The gateway authenticates the caller and checks its budget. It applies the routing rule — maybe "send this customer-support traffic to Claude Sonnet 4.6, fall back to GPT-5 mini." Before it spends a token it checks the cache: an exact-match cache for identical prompts, or a semantic cache that embeds the prompt and looks for a near-duplicate. On a miss, input guardrails scan for prompt injection and PII, the request goes to the provider, output guardrails scan the response, and the whole trace — model, tokens, latency, cost, cache status — lands in the log before the answer streams back to your service.

## Routing and fallback: the core of every LLM gateway

Routing is the feature people actually buy a gateway for. *llm routing* comes in a few flavors. Static routing pins a workload to a model. Fallback routing tries a primary and drops to a secondary whenever the first errors or hits a rate limit. Latency-based routing races providers or picks the fastest healthy one. Cost-based routing sends cheap, low-stakes calls to a small model and reserves the frontier model for the hard ones. The mistake teams make is treating routing as a model-quality decision when it's really a reliability decision: the reason you want fallback isn't that GPT-5 is better than Claude on Tuesday. It's that providers have outages and rate limits, and a gateway turns a hard provider failure into a soft model swap.
Here's the thing every vendor page glosses over: fallback only helps if your downstream code is provider-agnostic. If your prompts are tuned to one model's quirks, a fallback to a different family can degrade quality silently. The same advice applies to the broader case where you're already routing multi-agent traffic; we cover how that traffic flows in our piece on [multi-agent orchestration patterns](/blog/multi-agent-orchestration-patterns/), and the gateway is where that orchestration's reliability lives. Test your fallback path under load before you trust it.

## Cost control: how an LLM gateway actually saves money (with the math)

"Cost control" is on every gateway's feature list and quantified on almost none of them, so let's do the math. A gateway saves money three ways, in rough order of impact. First, model routing: sending a classifier-grade or extraction workload from a frontier model to a small hosted model (Haiku 4.5 or a GPT-5-nano-class model) cuts per-call token cost by roughly an order of magnitude at 2026 pricing. That's the single biggest lever the gateway exposes, and it costs nothing to pull beyond writing the route. Second, semantic caching: on high-repeat traffic like FAQ-style support, a semantic cache routinely returns 20–40% hits, and every hit is a provider call you simply don't make. Third, hard budget enforcement: per-team quotas that actually cut off a runaway loop before it bills you for a million tokens.
Those bars are illustrative ratios, not a promise — your numbers depend entirely on your traffic mix, and a gateway that can't tell you your real cache-hit rate and per-team spend isn't doing the cost job at all. Note what we're not quoting: any engagement price. Per-token and cache-hit math is the right way to anchor a buying decision; a dollar tier on a blog isn't. If you're sizing this kind of decision across multiple teams, our [AI automation buyer's guide](/blog/ai-automation-solutions-buyers-guide/) walks through the cost-attribution questions to ask before you centralize anything.

## Guardrails and governance: the security case for an LLM gateway

The governance argument is the one that gets a gateway funded in regulated shops, and it's genuinely strong. When every model call passes through one box, you get one place to enforce four things that are miserable to enforce per-application: input guardrails that scan for prompt injection and strip PII before it reaches a third-party provider; output guardrails that catch leaked secrets or policy violations; centralized audit logging that records who called what model with what data; and key management so no provider key ever lives in an app's environment. A gateway turns "we think every team handles PII correctly" into "the gateway guarantees it," which is the difference an auditor cares about.

> [!NOTE] (rich block: callout)

## Observability: what an LLM gateway gives you that the provider dashboard won't

Provider dashboards show you the provider's view: total tokens, total spend, nothing about which of your features or teams drove it. A gateway sits at the chokepoint, so it can attribute every call (by team, by feature, by route, by user) and emit traces you can ship to Langfuse, LangSmith, or your existing OpenTelemetry pipeline into Datadog. That's the difference between "we spent a lot on OpenAI last month" and "the support summarizer is 60% of spend and 90% of it is cacheable." The cost of buying this is a small, measurable latency tax, which you should benchmark rather than assume away.

## LLM gateway vs AI gateway vs MCP gateway: clearing up the confusion

Google's own "people also ask" surfaces "what's the difference between MCP and LLM gateway," which tells you the terms are a mess. Here's the clean version. An LLM gateway routes access to models: completions in, completions out, plus routing, caching, guardrails, and cost tracking. An MCP gateway routes access to tools and data — it sits in front of Model Context Protocol servers and manages which agents can reach which tools. An AI gateway is the umbrella term that's drifting to mean "both, plus agent coordination" — it routes to models and tools and orchestrates the traffic between them. They operate at different layers, and a serious platform often runs more than one.
Practically: most teams start needing an LLM gateway first, because providers multiply before tools do. *llm orchestration* — the agent-to-agent and tool layer — is a later problem, and conflating it with model access early just means you buy a heavier product than you need.

## Build vs buy: should you run LiteLLM, buy a gateway, or write your own?

This is the decision the whole post builds to, and the honest answer is that it's decided by your failover and guardrail requirements, not your traffic volume. There are three real options. Write your own: a couple hundred lines of OpenAI-compatible proxy handles routing, fallback, and key management, and you own every line, which is great for a small surface and painful once you want caching and guardrails. Self-host the mature open source: LiteLLM is the most widely adopted OSS proxy, supports 100+ providers, and has a real admin UI; you run it, you patch it, you own the uptime. Buy managed: Portkey or TrueFoundry give you the full feature set with an SLA, at the cost of a vendor in your hot path and a per-request fee.

## The single point of failure problem: designing an LLM gateway to fail open

Here's the part that doesn't make it onto the feature lists. A gateway sits in the hot path of every model call, which means it's now a single point of failure for your entire AI surface. Teams discover this the first time the gateway's own control plane has an outage and every product that touches AI goes dark at once — not because a provider failed, but because the box in front of the providers did. The gateway you adopted for reliability just became your reliability ceiling.
A gateway worth running is designed to fail open. That means the data plane keeps forwarding requests even when the control plane (config, dashboards, billing) is degraded; clients have a documented bypass that calls the provider directly when the gateway is unreachable; timeouts are aggressive so a slow gateway doesn't cascade; and you've actually run the game-day where you kill the gateway and confirm traffic still flows. Fail-closed is the default for most products, and it's the wrong default for infrastructure in your hot path. If a gateway can't fail open, it's a liability dressed as infrastructure.

## A minimal LLM gateway you can stand up this week

If you want to feel what a gateway does before you commit to a vendor, stand up a minimal one. Three shapes cover most starting points: a LiteLLM config that gives you multi-provider routing and fallback declaratively, a hand-rolled fallback router so you can see exactly what an *llm proxy* is doing under the hood, and the client call that's identical whether you point at LiteLLM, a DIY proxy, or a managed gateway — because they're all OpenAI-compatible. Start with the DIY router to build intuition, then graduate to LiteLLM when you want caching and a UI without writing them yourself.

## When you do NOT need an LLM gateway

The most useful thing a non-vendor can tell you is when to skip the gateway entirely. If you call one model from one provider, aren't doing per-team cost attribution, and have no compliance requirement for centralized logging, a gateway adds a network hop, a new dependency, and a single point of failure for no benefit. The provider SDK already handles retries. You can add a gateway in an afternoon the day you actually cross a threshold — there's no architectural penalty for waiting, and a real one for adopting infrastructure ahead of the need. Before you add any shared infrastructure, it's worth running an honest [AI readiness assessment](/blog/ai-readiness-assessment/) to confirm the need is real and not anticipatory.

## FAQ — LLM gateway questions, answered straight

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/
