LLM gateway: what it does and when you need one

Every team that runs more than one large language model in production eventually draws the same box on its architecture diagram: a single layer sitting between the application and the providers, with every model call passing through it. That box is the llm gateway. The search results for the term are a wall of vendors explaining the concept right up to the point where the explanation becomes "...and that's why you should buy ours." We don't sell a gateway. We build LLM systems, and we've put gateways in front of them, watched them become the most load-bearing box on the diagram, and ripped a few out when they turned out to be solving a problem the team didn't have yet.

This is the guide we wish existed when we were making that build-versus-buy call: what an LLM gateway actually does, how a request moves through one, where it earns its keep on cost and governance, and the honest answer to the question every vendor page avoids, which is when you don't need one at all. The keyword has a $18 cost-per-click, which tells you the audience reading it is evaluating a purchase. So we'll treat you like a buyer, not a lead.

LLM gateway in one paragraph — and the one question it's really answering

An LLM gateway is a proxy that sits between your applications and one or more model providers and gives you a single, consistent API in front of all of them. Instead of each service holding its own OpenAI key, its own Anthropic key, and its own retry logic, every call goes to the gateway, which authenticates it, decides which model should serve it, optionally checks a cache, applies guardrails, forwards the request to the provider, and logs the whole thing. The pattern is borrowed wholesale from the API gateway that's sat in front of web services for a decade: cross-cutting concerns like authentication and rate limiting don't belong inside every app, so you pull them out into one box.

The question an LLM gateway is really answering isn't "how do I call a model," because the provider SDKs do that fine. It's "how do I run model traffic across a growing number of providers, teams, and products without re-implementing routing, fallback, cost tracking, and safety in every codebase, and without being married to a single vendor." If you only ever call one model from one provider, that question doesn't bite yet. The moment you have a second provider or a second team, it does.

What an LLM gateway actually does: the five jobs under the hood

Strip away the marketing and a gateway does five jobs. Most products bundle all five; the cheaper open-source proxies do the first two well and bolt the rest on. Knowing which jobs you actually need is the whole build-versus-buy decision, so it's worth being precise about each.

The five jobs a request passes through inside the gateway

Client

ONE API KEY

Auth + quota

WHO + BUDGET

Route

MODEL CHOICE

Cache

EXACT / SEMANTIC

Guardrail

IN + OUT

Provider

OPENAI / ANTHROPIC / ...

Log + meter

TRACE + COST

Job one is access unification: present an OpenAI-compatible API so every downstream service speaks one dialect, while the gateway translates to whatever each provider expects. Job two is routing and fallback, which means picking a model per request and failing over to another when the first errors or times out. Job three is cost control: metering spend per team, caching responses, and routing cheap workloads to cheap models. Job four is governance, covering auth, rate limits, prompt-injection and PII guardrails, and audit logging. Job five is observability: the traces, token counts, and latency you can actually attribute. The phrase ai gateway usually means a gateway that does all five plus tool and agent coordination, a distinction we come back to later.

The request path: how a call moves through an LLM gateway

It helps to follow a single request end to end. A chat completion leaves your service with the gateway's API key, not the provider's. The gateway authenticates the caller and checks its budget. It applies the routing rule — maybe "send this customer-support traffic to Claude Sonnet 4.6, fall back to GPT-5 mini." Before it spends a token it checks the cache: an exact-match cache for identical prompts, or a semantic cache that embeds the prompt and looks for a near-duplicate. On a miss, input guardrails scan for prompt injection and PII, the request goes to the provider, output guardrails scan the response, and the whole trace — model, tokens, latency, cost, cache status — lands in the log before the answer streams back to your service.

One request, end to end, through the gateway

The happy path runs left to right; the dashed edge is the fallback that fires when the primary provider errors or times out. The cache short-circuits everything to the right of it on a hit.

Routing and fallback: the core of every LLM gateway

Routing is the feature people actually buy a gateway for. llm routing comes in a few flavors. Static routing pins a workload to a model. Fallback routing tries a primary and drops to a secondary whenever the first errors or hits a rate limit. Latency-based routing races providers or picks the fastest healthy one. Cost-based routing sends cheap, low-stakes calls to a small model and reserves the frontier model for the hard ones. The mistake teams make is treating routing as a model-quality decision when it's really a reliability decision: the reason you want fallback isn't that GPT-5 is better than Claude on Tuesday. It's that providers have outages and rate limits, and a gateway turns a hard provider failure into a soft model swap.

Here's the thing every vendor page glosses over: fallback only helps if your downstream code is provider-agnostic. If your prompts are tuned to one model's quirks, a fallback to a different family can degrade quality silently. The same advice applies to the broader case where you're already routing multi-agent traffic; we cover how that traffic flows in our piece on multi-agent orchestration patterns, and the gateway is where that orchestration's reliability lives. Test your fallback path under load before you trust it.

gateway_client.py python

# An LLM gateway is OpenAI-compatible, so your existing client just
# points at the gateway's base_url. Routing + fallback live in the
# gateway config, not your app code.
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.internal/v1",  # the gateway, not api.openai.com
    api_key=GATEWAY_KEY,                       # one key, scoped per team
)

resp = client.chat.completions.create(
    # "model" is a logical route the gateway resolves, not a hard model id.
    model="support-tier",  # gateway: primary=claude-sonnet-4-6, fallback=gpt-5-mini
    messages=[{"role": "user", "content": user_msg}],
    extra_headers={"x-team": "support"},  # so spend is attributed correctly
)
# If claude-sonnet-4-6 errors or times out, the gateway transparently
# retries on gpt-5-mini. Your code never sees the failover.

Cost control: how an LLM gateway actually saves money (with the math)

"Cost control" is on every gateway's feature list and quantified on almost none of them, so let's do the math. A gateway saves money three ways, in rough order of impact. First, model routing: sending a classifier-grade or extraction workload from a frontier model to a small hosted model (Haiku 4.5 or a GPT-5-nano-class model) cuts per-call token cost by roughly an order of magnitude at 2026 pricing. That's the single biggest lever the gateway exposes, and it costs nothing to pull beyond writing the route. Second, semantic caching: on high-repeat traffic like FAQ-style support, a semantic cache routinely returns 20–40% hits, and every hit is a provider call you simply don't make. Third, hard budget enforcement: per-team quotas that actually cut off a runaway loop before it bills you for a million tokens.

Relative spend per 1k requests as you stack gateway cost levers

Frontier model, no cache (baseline)

100% of baseline

Every call hits the most expensive model

+ route cheap workloads to small model

35% of baseline

Biggest single lever; order-of-magnitude per routed call

+ semantic cache on repeat traffic

24% of baseline

20-40% of calls never reach a provider

+ per-team hard budget caps

22% of baseline

Cuts the tail risk of runaway loops

Those bars are illustrative ratios, not a promise — your numbers depend entirely on your traffic mix, and a gateway that can't tell you your real cache-hit rate and per-team spend isn't doing the cost job at all. Note what we're not quoting: any engagement price. Per-token and cache-hit math is the right way to anchor a buying decision; a dollar tier on a blog isn't. If you're sizing this kind of decision across multiple teams, our AI automation buyer's guide walks through the cost-attribution questions to ask before you centralize anything.

Guardrails and governance: the security case for an LLM gateway

The governance argument is the one that gets a gateway funded in regulated shops, and it's genuinely strong. When every model call passes through one box, you get one place to enforce four things that are miserable to enforce per-application: input guardrails that scan for prompt injection and strip PII before it reaches a third-party provider; output guardrails that catch leaked secrets or policy violations; centralized audit logging that records who called what model with what data; and key management so no provider key ever lives in an app's environment. A gateway turns "we think every team handles PII correctly" into "the gateway guarantees it," which is the difference an auditor cares about.

Observability: what an LLM gateway gives you that the provider dashboard won't

Provider dashboards show you the provider's view: total tokens, total spend, nothing about which of your features or teams drove it. A gateway sits at the chokepoint, so it can attribute every call (by team, by feature, by route, by user) and emit traces you can ship to Langfuse, LangSmith, or your existing OpenTelemetry pipeline into Datadog. That's the difference between "we spent a lot on OpenAI last month" and "the support summarizer is 60% of spend and 90% of it is cacheable." The cost of buying this is a small, measurable latency tax, which you should benchmark rather than assume away.

What good gateway observability looks like (target, 2026)

100%

TRACE COVERAGE

Every call logged, not sampled

per-team

COST ATTRIBUTION

Spend split by team/feature/route

5-30ms

P95 ADDED LATENCY

On cache miss — measure before adopting

OTel

EXPORT FORMAT

Into Langfuse / Datadog you already run

LLM gateway vs AI gateway vs MCP gateway: clearing up the confusion

Google's own "people also ask" surfaces "what's the difference between MCP and LLM gateway," which tells you the terms are a mess. Here's the clean version. An LLM gateway routes access to models: completions in, completions out, plus routing, caching, guardrails, and cost tracking. An MCP gateway routes access to tools and data — it sits in front of Model Context Protocol servers and manages which agents can reach which tools. An AI gateway is the umbrella term that's drifting to mean "both, plus agent coordination" — it routes to models and tools and orchestrates the traffic between them. They operate at different layers, and a serious platform often runs more than one.

Layer	What it routes	Example tools	When you need it
LLM gateway	Model access: completions, routing, fallback, caching, guardrails	LiteLLM, Portkey, TrueFoundry, OpenRouter, Helicone	Two+ providers, or cost/governance needs
MCP gateway	Tool + data access for agents via Model Context Protocol	MCP servers behind an access broker	Agents calling many tools with access control
AI gateway	Both models and tools, plus agent-to-agent coordination	Kong AI Gateway, Cloudflare AI Gateway, AWS Bedrock guidance	A platform coordinating models + tools + agents

The three gateways are different layers, not competitors. Most mature AI platforms end up running an LLM gateway, and add the others as agent and tool complexity grows.

Practically: most teams start needing an LLM gateway first, because providers multiply before tools do. llm orchestration — the agent-to-agent and tool layer — is a later problem, and conflating it with model access early just means you buy a heavier product than you need.

Build vs buy: should you run LiteLLM, buy a gateway, or write your own?

This is the decision the whole post builds to, and the honest answer is that it's decided by your failover and guardrail requirements, not your traffic volume. There are three real options. Write your own: a couple hundred lines of OpenAI-compatible proxy handles routing, fallback, and key management, and you own every line, which is great for a small surface and painful once you want caching and guardrails. Self-host the mature open source: LiteLLM is the most widely adopted OSS proxy, supports 100+ providers, and has a real admin UI; you run it, you patch it, you own the uptime. Buy managed: Portkey or TrueFoundry give you the full feature set with an SLA, at the cost of a vendor in your hot path and a per-request fee.

Requirement	Write your own	Self-host LiteLLM	Buy managed (Portkey / TrueFoundry)
Multi-provider routing + fallback	Yes, ~200 lines	Yes, config-driven	Yes, plus UI
Per-team cost attribution	You build the metering	Built in	Built in, dashboards
Semantic caching	Significant build (embeddings + store)	Available, you wire the store	Managed
Injection / PII guardrails	You own a hard, evolving problem	Plugin ecosystem	Managed + updated
Control-plane SLA / 99.95%	Your on-call now owns it	Your on-call owns it	Vendor's SLA
No third party in the hot path	Met	Met (you host it)	Not met

Read down the column that matches your hard requirements. If you only need the top two rows, a DIY proxy is genuinely fine. The moment caching, guardrails, and an SLA appear, the DIY column stops being honest.

Engineer note

We've made this call across enough engagements to have a default. For a single product calling one or two providers, we write the proxy — it's a day of work, zero new vendor, and we own the failover logic completely. The moment a second team shows up wanting cost attribution, or compliance wants centralized logging, we reach for LiteLLM self-hosted before we reach for a managed vendor, because keeping the gateway inside our own infrastructure removes a third party from the hot path. We only recommend buying managed when the team genuinely can't carry the gateway's on-call — and then we make the vendor prove their fail-open story before anything routes through them. The trap we watch for: teams buying a heavy managed gateway on day one to solve a problem they won't have for a year, and inheriting a single point of failure they didn't need yet.

The single point of failure problem: designing an LLM gateway to fail open

Here's the part that doesn't make it onto the feature lists. A gateway sits in the hot path of every model call, which means it's now a single point of failure for your entire AI surface. Teams discover this the first time the gateway's own control plane has an outage and every product that touches AI goes dark at once — not because a provider failed, but because the box in front of the providers did. The gateway you adopted for reliability just became your reliability ceiling.

A gateway worth running is designed to fail open. That means the data plane keeps forwarding requests even when the control plane (config, dashboards, billing) is degraded; clients have a documented bypass that calls the provider directly when the gateway is unreachable; timeouts are aggressive so a slow gateway doesn't cascade; and you've actually run the game-day where you kill the gateway and confirm traffic still flows. Fail-closed is the default for most products, and it's the wrong default for infrastructure in your hot path. If a gateway can't fail open, it's a liability dressed as infrastructure.

Fail-open vs fail-closed, when the gateway control plane dies

Left: a fail-closed gateway is a hard dependency — control-plane outage means total AI outage. Right: a fail-open design lets the data plane keep serving and gives clients a direct-to-provider bypass, so a gateway outage degrades instead of detonates.

A minimal LLM gateway you can stand up this week

If you want to feel what a gateway does before you commit to a vendor, stand up a minimal one. Three shapes cover most starting points: a LiteLLM config that gives you multi-provider routing and fallback declaratively, a hand-rolled fallback router so you can see exactly what an llm proxy is doing under the hood, and the client call that's identical whether you point at LiteLLM, a DIY proxy, or a managed gateway — because they're all OpenAI-compatible. Start with the DIY router to build intuition, then graduate to LiteLLM when you want caching and a UI without writing them yourself.

LiteLLM configDIY fallback routerClient call (TS)

litellm_config.yaml yaml

# Declarative multi-provider routing + fallback. Run:
#   litellm --config litellm_config.yaml
model_list:
  - model_name: support-tier        # the logical route your app calls
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_KEY
  - model_name: support-tier        # same route, fallback target
    litellm_params:
      model: openai/gpt-5-mini
      api_key: os.environ/OPENAI_KEY

router_settings:
  fallbacks: [{ "support-tier": ["support-tier"] }]
  num_retries: 2
  request_timeout: 30
  # cache: { type: redis }   # add when you want semantic/exact caching

mini_gateway.py python

# ~30 lines: routing + fallback, so you see what a gateway really does.
import time, httpx

ROUTES = {
    "support-tier": [
        ("anthropic", "claude-sonnet-4-6"),
        ("openai", "gpt-5-mini"),  # fallback
    ]
}

def complete(route: str, messages: list) -> dict:
    last_err = None
    for provider, model in ROUTES[route]:
        try:
            t0 = time.time()
            resp = call_provider(provider, model, messages, timeout=30)
            log(route, provider, model, tokens=resp["usage"], ms=(time.time()-t0)*1000)
            return resp
        except (httpx.TimeoutException, httpx.HTTPStatusError) as e:
            last_err = e            # swallow + fail over to next target
    raise RuntimeError(f"all providers failed for {route}") from last_err

app.ts typescript

// Your app never changes when you swap gateways. base_url is the only knob.
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: process.env.GATEWAY_URL, // LiteLLM, DIY proxy, or Portkey — same code
  apiKey: process.env.GATEWAY_KEY,
});

const res = await client.chat.completions.create({
  model: "support-tier",            // logical route, resolved by the gateway
  messages: [{ role: "user", content: userMsg }],
});
// Fallback, caching, cost metering all happen server-side, invisibly.

When you do NOT need an LLM gateway

The most useful thing a non-vendor can tell you is when to skip the gateway entirely. If you call one model from one provider, aren't doing per-team cost attribution, and have no compliance requirement for centralized logging, a gateway adds a network hop, a new dependency, and a single point of failure for no benefit. The provider SDK already handles retries. You can add a gateway in an afternoon the day you actually cross a threshold — there's no architectural penalty for waiting, and a real one for adopting infrastructure ahead of the need. Before you add any shared infrastructure, it's worth running an honest AI readiness assessment to confirm the need is real and not anticipatory.

Add the gateway now

You run two or more providers, or you need per-team cost attribution, semantic caching, injection/PII guardrails, or centralized audit logging for compliance. Any one of these crosses the threshold. The gateway pays for its network hop immediately because it's doing real work on every call.

Not yet — you'd be adding risk

One provider, one or two services, no cost-attribution or compliance pressure, prompts tuned to a single model. A gateway here is a dependency and a single point of failure with no offsetting benefit. The provider SDK's retry logic already covers you. Revisit the day you add a second provider.

FAQ — LLM gateway questions, answered straight

What is an LLM gateway?

An LLM gateway is a proxy between your applications and one or more model providers that gives you a single, OpenAI-compatible API in front of all of them. It handles routing and fallback between models, caching, cost tracking, guardrails, and centralized logging — pulling those cross-cutting concerns out of every app into one box, the same way a traditional API gateway does for web services.

What's the difference between an LLM gateway and an AI gateway?

An LLM gateway routes access to models — completions, routing, caching, guardrails. An AI gateway is the broader umbrella term that increasingly means "models plus tools plus agent-to-agent coordination." In practice most teams need the LLM gateway first because providers multiply before tools and agents do; the AI-gateway layer becomes relevant once you're orchestrating many agents and tools.

What's the difference between an MCP gateway and an LLM gateway?

An LLM gateway manages connections to language models (routing, fallback, caching, cost tracking). An MCP gateway manages connections to tools and data, sitting in front of Model Context Protocol servers and controlling which agents reach which tools. They're different layers — model access versus tool access — and a mature platform may run both.

Should I build my own LLM gateway or buy one?

It's decided by your failover and guardrail requirements, not your traffic volume. If you only need routing and fallback, a ~200-line OpenAI-compatible proxy is genuinely fine. Once you need semantic caching, injection/PII guardrails, per-team cost attribution, and a control-plane SLA, either self-host the mature open source (LiteLLM) or buy managed (Portkey, TrueFoundry). The line is capability, not requests per second.

Is an LLM gateway a single point of failure?

Yes — it sits in the hot path of every model call, so a gateway outage can take down your whole AI surface. A gateway worth running is designed to fail open: the data plane keeps forwarding requests when the control plane is degraded, clients have a documented direct-to-provider bypass, and you've run the game-day to confirm traffic still flows when the gateway is killed. If a gateway can't fail open, it's a liability.

How much latency does an LLM gateway add?

A mature gateway adds roughly 5–30ms of p95 overhead on a cache miss as of 2026 — small against a multi-second LLM call, but real, and you should measure it on your own infrastructure before adopting. On a cache hit the gateway is faster than the provider because it returns without a model call at all.

Which LLM gateways are worth looking at?

LiteLLM is the most widely adopted open-source proxy and supports 100+ providers; Portkey and TrueFoundry are mature managed options; OpenRouter and Helicone cover routing and observability respectively; Kong AI Gateway and Cloudflare AI Gateway extend existing API-gateway platforms; AWS publishes a reference multi-provider gateway guidance. Shortlist by your hard requirements (caching, guardrails, SLA, hot-path tolerance) rather than by feature-list length.

LLM DEVELOPMENT

Put the right gateway in front of your models — or none at all.

We design LLM infrastructure that fails open and earns its place in the hot path, and we'll tell you when you don't need a gateway yet. If you're routing real model traffic, talk to the team that has made this call across 200+ engagements.

Explore our llm development services Start the conversation

LLM gateway: what it does and when you need one

LLM gateway in one paragraph — and the one question it's really answering

What an LLM gateway actually does: the five jobs under the hood

The request path: how a call moves through an LLM gateway

Routing and fallback: the core of every LLM gateway

Cost control: how an LLM gateway actually saves money (with the math)

Guardrails and governance: the security case for an LLM gateway

Observability: what an LLM gateway gives you that the provider dashboard won't

LLM gateway vs AI gateway vs MCP gateway: clearing up the confusion

Build vs buy: should you run LiteLLM, buy a gateway, or write your own?

The single point of failure problem: designing an LLM gateway to fail open

A minimal LLM gateway you can stand up this week

When you do NOT need an LLM gateway

FAQ — LLM gateway questions, answered straight

Put the right gateway in front of your models — or none at all.

Want help shipping this?

Talk to the engineer
who'd lead the work.

Thanks —,
a reply is on the way.

LLM gateway in one paragraph — and the one question it's really answering

What an LLM gateway actually does: the five jobs under the hood

The request path: how a call moves through an LLM gateway

Routing and fallback: the core of every LLM gateway

Cost control: how an LLM gateway actually saves money (with the math)

Guardrails and governance: the security case for an LLM gateway

Observability: what an LLM gateway gives you that the provider dashboard won't

LLM gateway vs AI gateway vs MCP gateway: clearing up the confusion

Build vs buy: should you run LiteLLM, buy a gateway, or write your own?

The single point of failure problem: designing an LLM gateway to fail open

A minimal LLM gateway you can stand up this week

When you do NOT need an LLM gateway

FAQ — LLM gateway questions, answered straight

Put the right gateway in front of your models — or none at all.

Continue reading.

LLM benchmarking: what each benchmark really measures

Insurance Chatbot Build Guide: Architecture, Compliance, and the Surfaces That Carry Load

AI readiness assessment: a vendor-neutral scoring rubric

Want help shipping this?