LLM gateway: what it does and when you need one
An infra buyer's guide to the LLM gateway: routing, fallback, cost control, guardrails and observability, the build-vs-buy decision, and when you don't need one.
Every team that runs more than one large language model in production eventually draws the same box on its architecture diagram: a single layer sitting between the application and the providers, with every model call passing through it. That box is the llm gateway. The search results for the term are a wall of vendors explaining the concept right up to the point where the explanation becomes "...and that's why you should buy ours." We don't sell a gateway. We build LLM systems, and we've put gateways in front of them, watched them become the most load-bearing box on the diagram, and ripped a few out when they turned out to be solving a problem the team didn't have yet.
This is the guide we wish existed when we were making that build-versus-buy call: what an LLM gateway actually does, how a request moves through one, where it earns its keep on cost and governance, and the honest answer to the question every vendor page avoids, which is when you don't need one at all. The keyword has a $18 cost-per-click, which tells you the audience reading it is evaluating a purchase. So we'll treat you like a buyer, not a lead.
LLM gateway in one paragraph — and the one question it's really answering
An LLM gateway is a proxy that sits between your applications and one or more model providers and gives you a single, consistent API in front of all of them. Instead of each service holding its own OpenAI key, its own Anthropic key, and its own retry logic, every call goes to the gateway, which authenticates it, decides which model should serve it, optionally checks a cache, applies guardrails, forwards the request to the provider, and logs the whole thing. The pattern is borrowed wholesale from the API gateway that's sat in front of web services for a decade: cross-cutting concerns like authentication and rate limiting don't belong inside every app, so you pull them out into one box.
The question an LLM gateway is really answering isn't "how do I call a model," because the provider SDKs do that fine. It's "how do I run model traffic across a growing number of providers, teams, and products without re-implementing routing, fallback, cost tracking, and safety in every codebase, and without being married to a single vendor." If you only ever call one model from one provider, that question doesn't bite yet. The moment you have a second provider or a second team, it does.
What an LLM gateway actually does: the five jobs under the hood
Strip away the marketing and a gateway does five jobs. Most products bundle all five; the cheaper open-source proxies do the first two well and bolt the rest on. Knowing which jobs you actually need is the whole build-versus-buy decision, so it's worth being precise about each.
Job one is access unification: present an OpenAI-compatible API so every downstream service speaks one dialect, while the gateway translates to whatever each provider expects. Job two is routing and fallback, which means picking a model per request and failing over to another when the first errors or times out. Job three is cost control: metering spend per team, caching responses, and routing cheap workloads to cheap models. Job four is governance, covering auth, rate limits, prompt-injection and PII guardrails, and audit logging. Job five is observability: the traces, token counts, and latency you can actually attribute. The phrase ai gateway usually means a gateway that does all five plus tool and agent coordination, a distinction we come back to later.
The request path: how a call moves through an LLM gateway
It helps to follow a single request end to end. A chat completion leaves your service with the gateway's API key, not the provider's. The gateway authenticates the caller and checks its budget. It applies the routing rule — maybe "send this customer-support traffic to Claude Sonnet 4.6, fall back to GPT-5 mini." Before it spends a token it checks the cache: an exact-match cache for identical prompts, or a semantic cache that embeds the prompt and looks for a near-duplicate. On a miss, input guardrails scan for prompt injection and PII, the request goes to the provider, output guardrails scan the response, and the whole trace — model, tokens, latency, cost, cache status — lands in the log before the answer streams back to your service.
Routing and fallback: the core of every LLM gateway
Routing is the feature people actually buy a gateway for. llm routing comes in a few flavors. Static routing pins a workload to a model. Fallback routing tries a primary and drops to a secondary whenever the first errors or hits a rate limit. Latency-based routing races providers or picks the fastest healthy one. Cost-based routing sends cheap, low-stakes calls to a small model and reserves the frontier model for the hard ones. The mistake teams make is treating routing as a model-quality decision when it's really a reliability decision: the reason you want fallback isn't that GPT-5 is better than Claude on Tuesday. It's that providers have outages and rate limits, and a gateway turns a hard provider failure into a soft model swap.
Here's the thing every vendor page glosses over: fallback only helps if your downstream code is provider-agnostic. If your prompts are tuned to one model's quirks, a fallback to a different family can degrade quality silently. The same advice applies to the broader case where you're already routing multi-agent traffic; we cover how that traffic flows in our piece on multi-agent orchestration patterns, and the gateway is where that orchestration's reliability lives. Test your fallback path under load before you trust it.
# An LLM gateway is OpenAI-compatible, so your existing client just
# points at the gateway's base_url. Routing + fallback live in the
# gateway config, not your app code.
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.internal/v1", # the gateway, not api.openai.com
api_key=GATEWAY_KEY, # one key, scoped per team
)
resp = client.chat.completions.create(
# "model" is a logical route the gateway resolves, not a hard model id.
model="support-tier", # gateway: primary=claude-sonnet-4-6, fallback=gpt-5-mini
messages=[{"role": "user", "content": user_msg}],
extra_headers={"x-team": "support"}, # so spend is attributed correctly
)
# If claude-sonnet-4-6 errors or times out, the gateway transparently
# retries on gpt-5-mini. Your code never sees the failover. Cost control: how an LLM gateway actually saves money (with the math)
"Cost control" is on every gateway's feature list and quantified on almost none of them, so let's do the math. A gateway saves money three ways, in rough order of impact. First, model routing: sending a classifier-grade or extraction workload from a frontier model to a small hosted model (Haiku 4.5 or a GPT-5-nano-class model) cuts per-call token cost by roughly an order of magnitude at 2026 pricing. That's the single biggest lever the gateway exposes, and it costs nothing to pull beyond writing the route. Second, semantic caching: on high-repeat traffic like FAQ-style support, a semantic cache routinely returns 20–40% hits, and every hit is a provider call you simply don't make. Third, hard budget enforcement: per-team quotas that actually cut off a runaway loop before it bills you for a million tokens.
Those bars are illustrative ratios, not a promise — your numbers depend entirely on your traffic mix, and a gateway that can't tell you your real cache-hit rate and per-team spend isn't doing the cost job at all. Note what we're not quoting: any engagement price. Per-token and cache-hit math is the right way to anchor a buying decision; a dollar tier on a blog isn't. If you're sizing this kind of decision across multiple teams, our AI automation buyer's guide walks through the cost-attribution questions to ask before you centralize anything.
Guardrails and governance: the security case for an LLM gateway
The governance argument is the one that gets a gateway funded in regulated shops, and it's genuinely strong. When every model call passes through one box, you get one place to enforce four things that are miserable to enforce per-application: input guardrails that scan for prompt injection and strip PII before it reaches a third-party provider; output guardrails that catch leaked secrets or policy violations; centralized audit logging that records who called what model with what data; and key management so no provider key ever lives in an app's environment. A gateway turns "we think every team handles PII correctly" into "the gateway guarantees it," which is the difference an auditor cares about.
Observability: what an LLM gateway gives you that the provider dashboard won't
Provider dashboards show you the provider's view: total tokens, total spend, nothing about which of your features or teams drove it. A gateway sits at the chokepoint, so it can attribute every call (by team, by feature, by route, by user) and emit traces you can ship to Langfuse, LangSmith, or your existing OpenTelemetry pipeline into Datadog. That's the difference between "we spent a lot on OpenAI last month" and "the support summarizer is 60% of spend and 90% of it is cacheable." The cost of buying this is a small, measurable latency tax, which you should benchmark rather than assume away.
LLM gateway vs AI gateway vs MCP gateway: clearing up the confusion
Google's own "people also ask" surfaces "what's the difference between MCP and LLM gateway," which tells you the terms are a mess. Here's the clean version. An LLM gateway routes access to models: completions in, completions out, plus routing, caching, guardrails, and cost tracking. An MCP gateway routes access to tools and data — it sits in front of Model Context Protocol servers and manages which agents can reach which tools. An AI gateway is the umbrella term that's drifting to mean "both, plus agent coordination" — it routes to models and tools and orchestrates the traffic between them. They operate at different layers, and a serious platform often runs more than one.
| Layer | What it routes | Example tools | When you need it |
|---|---|---|---|
| LLM gateway | Model access: completions, routing, fallback, caching, guardrails | LiteLLM, Portkey, TrueFoundry, OpenRouter, Helicone | Two+ providers, or cost/governance needs |
| MCP gateway | Tool + data access for agents via Model Context Protocol | MCP servers behind an access broker | Agents calling many tools with access control |
| AI gateway | Both models and tools, plus agent-to-agent coordination | Kong AI Gateway, Cloudflare AI Gateway, AWS Bedrock guidance | A platform coordinating models + tools + agents |
Practically: most teams start needing an LLM gateway first, because providers multiply before tools do. llm orchestration — the agent-to-agent and tool layer — is a later problem, and conflating it with model access early just means you buy a heavier product than you need.
Build vs buy: should you run LiteLLM, buy a gateway, or write your own?
This is the decision the whole post builds to, and the honest answer is that it's decided by your failover and guardrail requirements, not your traffic volume. There are three real options. Write your own: a couple hundred lines of OpenAI-compatible proxy handles routing, fallback, and key management, and you own every line, which is great for a small surface and painful once you want caching and guardrails. Self-host the mature open source: LiteLLM is the most widely adopted OSS proxy, supports 100+ providers, and has a real admin UI; you run it, you patch it, you own the uptime. Buy managed: Portkey or TrueFoundry give you the full feature set with an SLA, at the cost of a vendor in your hot path and a per-request fee.
| Requirement | Write your own | Self-host LiteLLM | Buy managed (Portkey / TrueFoundry) |
|---|---|---|---|
| Multi-provider routing + fallback | Yes, ~200 lines | Yes, config-driven | Yes, plus UI |
| Per-team cost attribution | You build the metering | Built in | Built in, dashboards |
| Semantic caching | Significant build (embeddings + store) | Available, you wire the store | Managed |
| Injection / PII guardrails | You own a hard, evolving problem | Plugin ecosystem | Managed + updated |
| Control-plane SLA / 99.95% | Your on-call now owns it | Your on-call owns it | Vendor's SLA |
| No third party in the hot path | Met | Met (you host it) | Not met |
The single point of failure problem: designing an LLM gateway to fail open
Here's the part that doesn't make it onto the feature lists. A gateway sits in the hot path of every model call, which means it's now a single point of failure for your entire AI surface. Teams discover this the first time the gateway's own control plane has an outage and every product that touches AI goes dark at once — not because a provider failed, but because the box in front of the providers did. The gateway you adopted for reliability just became your reliability ceiling.
A gateway worth running is designed to fail open. That means the data plane keeps forwarding requests even when the control plane (config, dashboards, billing) is degraded; clients have a documented bypass that calls the provider directly when the gateway is unreachable; timeouts are aggressive so a slow gateway doesn't cascade; and you've actually run the game-day where you kill the gateway and confirm traffic still flows. Fail-closed is the default for most products, and it's the wrong default for infrastructure in your hot path. If a gateway can't fail open, it's a liability dressed as infrastructure.
A minimal LLM gateway you can stand up this week
If you want to feel what a gateway does before you commit to a vendor, stand up a minimal one. Three shapes cover most starting points: a LiteLLM config that gives you multi-provider routing and fallback declaratively, a hand-rolled fallback router so you can see exactly what an llm proxy is doing under the hood, and the client call that's identical whether you point at LiteLLM, a DIY proxy, or a managed gateway — because they're all OpenAI-compatible. Start with the DIY router to build intuition, then graduate to LiteLLM when you want caching and a UI without writing them yourself.
# Declarative multi-provider routing + fallback. Run:
# litellm --config litellm_config.yaml
model_list:
- model_name: support-tier # the logical route your app calls
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_KEY
- model_name: support-tier # same route, fallback target
litellm_params:
model: openai/gpt-5-mini
api_key: os.environ/OPENAI_KEY
router_settings:
fallbacks: [{ "support-tier": ["support-tier"] }]
num_retries: 2
request_timeout: 30
# cache: { type: redis } # add when you want semantic/exact caching # ~30 lines: routing + fallback, so you see what a gateway really does.
import time, httpx
ROUTES = {
"support-tier": [
("anthropic", "claude-sonnet-4-6"),
("openai", "gpt-5-mini"), # fallback
]
}
def complete(route: str, messages: list) -> dict:
last_err = None
for provider, model in ROUTES[route]:
try:
t0 = time.time()
resp = call_provider(provider, model, messages, timeout=30)
log(route, provider, model, tokens=resp["usage"], ms=(time.time()-t0)*1000)
return resp
except (httpx.TimeoutException, httpx.HTTPStatusError) as e:
last_err = e # swallow + fail over to next target
raise RuntimeError(f"all providers failed for {route}") from last_err // Your app never changes when you swap gateways. base_url is the only knob.
import OpenAI from "openai";
const client = new OpenAI({
baseURL: process.env.GATEWAY_URL, // LiteLLM, DIY proxy, or Portkey — same code
apiKey: process.env.GATEWAY_KEY,
});
const res = await client.chat.completions.create({
model: "support-tier", // logical route, resolved by the gateway
messages: [{ role: "user", content: userMsg }],
});
// Fallback, caching, cost metering all happen server-side, invisibly. When you do NOT need an LLM gateway
The most useful thing a non-vendor can tell you is when to skip the gateway entirely. If you call one model from one provider, aren't doing per-team cost attribution, and have no compliance requirement for centralized logging, a gateway adds a network hop, a new dependency, and a single point of failure for no benefit. The provider SDK already handles retries. You can add a gateway in an afternoon the day you actually cross a threshold — there's no architectural penalty for waiting, and a real one for adopting infrastructure ahead of the need. Before you add any shared infrastructure, it's worth running an honest AI readiness assessment to confirm the need is real and not anticipatory.
You run two or more providers, or you need per-team cost attribution, semantic caching, injection/PII guardrails, or centralized audit logging for compliance. Any one of these crosses the threshold. The gateway pays for its network hop immediately because it's doing real work on every call.
One provider, one or two services, no cost-attribution or compliance pressure, prompts tuned to a single model. A gateway here is a dependency and a single point of failure with no offsetting benefit. The provider SDK's retry logic already covers you. Revisit the day you add a second provider.
FAQ — LLM gateway questions, answered straight
What is an LLM gateway?
An LLM gateway is a proxy between your applications and one or more model providers that gives you a single, OpenAI-compatible API in front of all of them. It handles routing and fallback between models, caching, cost tracking, guardrails, and centralized logging — pulling those cross-cutting concerns out of every app into one box, the same way a traditional API gateway does for web services.
What's the difference between an LLM gateway and an AI gateway?
An LLM gateway routes access to models — completions, routing, caching, guardrails. An AI gateway is the broader umbrella term that increasingly means "models plus tools plus agent-to-agent coordination." In practice most teams need the LLM gateway first because providers multiply before tools and agents do; the AI-gateway layer becomes relevant once you're orchestrating many agents and tools.
What's the difference between an MCP gateway and an LLM gateway?
An LLM gateway manages connections to language models (routing, fallback, caching, cost tracking). An MCP gateway manages connections to tools and data, sitting in front of Model Context Protocol servers and controlling which agents reach which tools. They're different layers — model access versus tool access — and a mature platform may run both.
Should I build my own LLM gateway or buy one?
It's decided by your failover and guardrail requirements, not your traffic volume. If you only need routing and fallback, a ~200-line OpenAI-compatible proxy is genuinely fine. Once you need semantic caching, injection/PII guardrails, per-team cost attribution, and a control-plane SLA, either self-host the mature open source (LiteLLM) or buy managed (Portkey, TrueFoundry). The line is capability, not requests per second.
Is an LLM gateway a single point of failure?
Yes — it sits in the hot path of every model call, so a gateway outage can take down your whole AI surface. A gateway worth running is designed to fail open: the data plane keeps forwarding requests when the control plane is degraded, clients have a documented direct-to-provider bypass, and you've run the game-day to confirm traffic still flows when the gateway is killed. If a gateway can't fail open, it's a liability.
How much latency does an LLM gateway add?
A mature gateway adds roughly 5–30ms of p95 overhead on a cache miss as of 2026 — small against a multi-second LLM call, but real, and you should measure it on your own infrastructure before adopting. On a cache hit the gateway is faster than the provider because it returns without a model call at all.
Which LLM gateways are worth looking at?
LiteLLM is the most widely adopted open-source proxy and supports 100+ providers; Portkey and TrueFoundry are mature managed options; OpenRouter and Helicone cover routing and observability respectively; Kong AI Gateway and Cloudflare AI Gateway extend existing API-gateway platforms; AWS publishes a reference multi-provider gateway guidance. Shortlist by your hard requirements (caching, guardrails, SLA, hot-path tolerance) rather than by feature-list length.
Put the right gateway in front of your models — or none at all.
We design LLM infrastructure that fails open and earns its place in the hot path, and we'll tell you when you don't need a gateway yet. If you're routing real model traffic, talk to the team that has made this call across 200+ engagements.