LLM Development

Production llm development services on Claude · GPT-5 · Llama 4.

Paiteq is an llm development company shipping custom language model applications with evaluation built in, hosted or self-hosted, with auth, observability, cost guardrails, and model routing wired before the first user lands. Not a notebook with a chat box.

Talk to engineering See engagement shapes

Models Claude · GPT-5 · Llama 4 · Mistral · Qwen

Practice App · Fine-tune · Deploy · Eval

Engage Pilot · Build · Tune · Audit

Compliance SOC-2-ready · HIPAA · on-prem

001 / SURFACES

Eight LLM workloads we ship.

Each surface below is a workload we've shipped to production, with the eval methodology, the model choice, and the cost engineering already worked out.

LLM application development sorts cleanly by workload shape, not by industry. A clinical-note copilot for a 50-person health-tech reuses the same eval methodology, the same model-routing logic, and the same observability stack as a sales-research copilot for an 800-person fin-tech. The integrations differ, the residency posture differs, the prompts differ, but the engineering shape doesn't. Sorting by workload is what lets us reuse the eval harness, the prompt-versioning patterns, and the cost-monitoring playbook across clients instead of reinventing them every engagement.

01 / APP BUILD ↗

Custom LLM applications

Full app build on Claude, GPT-5, Gemini, or Llama 4. Auth, persistence, observability, fallback. Not a notebook with a chat box, production llm application development with eval gates baked in.

AuthPersistEval gates

02 / FINE-TUNE ↗

Fine-tuning

LoRA / QLoRA on Llama 4, Mistral, Qwen. Or hosted fine-tunes on OpenAI / Anthropic. Dataset curation, eval-against-baseline, deploy only if the fine-tune wins.

LoRAQLoRAHosted

03 / EVAL ↗

Eval infrastructure

Eval sets, graders, regression alarms. Build this first, pick the model second. Inspect AI as the harness; RAGAS for retrieval-adjacent; custom graders for app-specific scoring.

Inspect AIRAGAS

04 / DEPLOY ↗

Production deploy

Hosted on the provider, self-hosted on your cloud via vLLM or TGI, or hybrid. Auth, rate-limit, observability, model fallback, cost guardrails, all wired before the first user lands.

vLLMHybrid

05 / ROUTING ↗

Model routing

Cost engineering by routing. A classifier sends easy queries to GPT-5 mini or Haiku; hard queries to GPT-5 or Sonnet. Quality holds via eval gates; spend drops 40–70%.

RouterCache

06 / OBSERVABILITY ↗

LLM observability

Langfuse, Helicone, LangSmith, what we instrument and why. Per-call cost, hallucination scoring, prompt versioning, regression alarms. Production traces feed the eval set monthly.

LangfuseHelicone

07 / SAFETY ↗

Guardrails + safety

Llama Guard, Presidio, custom policy classifiers. Pre-generation input filtering, structured-output enforcement, post-generation classification. Every production deploy ships with an adversarial eval set.

Llama GuardPresidio

08 / MULTIMODAL ↗

Multimodal apps

Vision plus text plus audio. GPT-5, Claude Sonnet 4.6, Gemini Pro for OCR-grade document extraction, image Q&A, voice-driven workflows. Modality choice follows the eval set, not the demo video.

VisionAudio

Heavier surfaces this year: internal copilots, doc extraction, model routing. Lighter: multimodal voice, where the latency budget rules out about half of candidate workloads, and fine-tuned models, where the data prep cost stops many engagements at "the baseline already covers it." We'll talk you out of an LLM-shaped engagement if the workload is actually a retrieval problem, better fit for grounded retrieval pipelines, or an autonomous task-execution problem, better fit for agent loops with state. About 15% of inbound LLM development services inquiries get redirected before the first call.

002 / SERVICES

LLM development services from an LLM development company, pick where to start.

Four engagement shapes. Each is fixed-scope and fixed-duration. You always know what's coming, when, and what counts as done.

Choosing an llm development company is mostly choosing the right starting shape. Buyers who come in with a scoped use case and a way to measure success ship to production around 80% of the time. Buyers who come in with "we want LLM somewhere in the product" ship around 30% of the time, usually after a re-scope. The four shapes below map onto the starting points we see most often, pick the one that matches what you have, not what you wish you had.

01 / PILOT ↗

LLM App Pilot

One scoped use case, eval-graded, demoed in 2–4 weeks. Fixed scope. About a third of pilots end at the pilot, that's a feature, not a failure.

2–4 wksFixed scope

02 / BUILD ↗

Production LLM Build

Full app with auth, observability, eval gates, deploy pipeline, and four weeks of post-launch iteration. 8–16 weeks. Fixed scope.

8–16 wks

03 / FINE-TUNE ↗

Fine-tuning engagement

Dataset curation, baseline eval, fine-tune run, head-to-head comparison. Deploy only if it wins; we'll recommend prompt-only if that's enough. 6–10 weeks.

6–10 wks

04 / AUDIT ↗

LLM App Audit

Eval coverage, cost profile, latency, safety posture. Report with a prioritised fix list and a fixed-scope follow-on if you want us to ship the fixes. 2–3 weeks.

2–3 wksAudit

If the workload is scoped but the model quality is unproven, start with a Pilot. If you know the workload works and you need production discipline (observability, eval gates, routing, deploy), start with a Production LLM Build. If you've already shipped LLM and it's underperforming or surprise-billing, start with an Audit. If the data tells you a fine-tune is justified, start with a Fine-tuning engagement, but the audit usually comes first because the eval set has to exist before fine-tuning can be evaluated. Week-by-week scope on each is further down the page.

003 / MODELS

Models, when each one wins.

Model choice follows workload, not house preferences. Cost-adjusted quality against your eval set decides, every time, no exceptions.

Claude
GPT-5
GPT-5 mini
Gemini
Llama 4
Mistral
Qwen
vLLM
TGI
Modal
Replicate
Langfuse
Helicone
Inspect AI
Llama Guard
Presidio
Claude
GPT-5
GPT-5 mini
Gemini
Llama 4
Mistral
Qwen
vLLM
TGI
Modal
Replicate
Langfuse
Helicone
Inspect AI
Llama Guard
Presidio

FRONTIER + OPEN-WEIGHT MODEL PICKS

For each model: what it's strongest at, when we pick it, when we don't, and the specific Paiteq pattern we use with it. We've shipped production llm development on every model below. The "when we don't" lines come from real builds, usually a moment in week 4 where the eval set told us to swap.

Claude Sonnet 4.6

Strengths

Strongest tool-call accuracy in our eval set across 2025–2026. Excellent at structured output. Prompt caching cuts cost ~80% on stable system prompts. Vision is competitive with GPT-5 on OCR-grade extraction.

When We Pick

Default for agentic workloads where the model holds state and calls tools. Default for any app that streams JSON. Default when prompt caching unlocks meaningful spend reduction (long stable system prompts).

When We Don't

Workloads that need a 128k+ context window with reliable recall, Gemini 3.0 Pro wins there. Hyper-cost-sensitive batch workloads where Haiku or GPT-5 mini covers the quality bar.

Paiteq Pattern

Our day-one baseline on most production LLM development services. About half the apps we ship in 2026 run on Sonnet plus prompt caching.

Tool-callJSONCache 80%

GPT-5

Strengths

Strongest multimodal, vision, voice, real-time bidirectional audio. Latency-tuned for streaming UIs. Robust function-calling. Batch API at ~50% off list for non-interactive workloads.

When We Pick

Multimodal workloads, voice apps via the Realtime API, and any client with an existing OpenAI contract where procurement won't add Anthropic. Batch summarisation pipelines.

When We Don't

Pure tool-call agentic workloads, Sonnet wins our evals there. Apps with strict refusal policies, GPT-5's safety surface is looser than Claude's by default.

Paiteq Pattern

Default for voice and vision pipelines. We pair it with GPT-5 mini as the routing fallback on most production deploys.

MultimodalRealtimeBatch

GPT-5 mini / Haiku

Strengths

10–20× cheaper per token than the flagships. Latency 2–3× faster. Good enough for ~70% of routing-tier traffic on most apps once a router classifier is in place.

When We Pick

As the cheap-tier in a routed stack. Easy classification, short summarisation, slot-filling extractors. Anywhere the eval set says the small model holds quality.

When We Don't

Hard reasoning, multi-step tool use, long context with citation. We've seen too many builds try to push these down to mini and watch task scores drop 15–25 points.

Paiteq Pattern

Roughly two-thirds of cost engineering work is just plumbing in a router that sends easy queries here. Spend cuts of 40–70% are common, quality holds when the eval set guards the routing thresholds.

CheapFastRouter-tier

Llama 4 / Mistral (self-hosted)

Strengths

Open weights. Data never leaves your perimeter. Fixed infra cost beats per-token billing above ~2M requests/day on dedicated GPUs. Tunable, quantization, LoRA adapters, custom vocab.

When We Pick

Regulated data with hard residency rules (healthcare PHI, finance MNPI, EU AI Act high-risk workloads). Very high-volume batch workloads where the math flips. Anywhere the team has the ops capacity to run an inference service.

When We Don't

Mixed-quality workloads where a flagship's edge matters per call. Teams without Kubernetes / GPU ops capacity, the operational tax shows up around month two.

Paiteq Pattern

We run Llama 4 70B on vLLM with A100 80GB nodes for residency-constrained clients. Sub-200ms TTFT achievable with continuous batching. ~$0.08 per 1K output tokens amortised at the workloads we see.

Open-weightsvLLMResidency

Gemini 3.0 Pro

Strengths

2M-token context window with usable recall. Strong multimodal, video and document understanding at length. Aggressive pricing on long-context workloads. Native Google Cloud integration.

When We Pick

Document-stack workloads where the corpus fits in context, usually under 1.5M tokens. Long-form synthesis. Clients standardised on Vertex AI for procurement reasons.

When We Don't

Tool-call-heavy agents, Sonnet still leads our evals there. Workloads where the long-context advantage doesn't pay back the per-token cost on the actual queries (most are still short).

Paiteq Pattern

Default for long-document Q&A where the customer would rather pay the per-token cost than maintain a retrieval pipeline. We'll talk you back to RAG if the corpus grows past ~1.5M tokens.

2M contextVisionVertex

Qwen 3 / specialised open

Strengths

Strong Chinese / Asian-language performance, competitive coding benchmarks, permissive licence. Sizes from 0.5B to 72B cover edge-to-cloud. Aggressive on multilingual workloads.

When We Pick

Multilingual workloads with heavy Chinese / Japanese / Korean traffic. Code-heavy domains where Qwen-Coder lifts pass-rate over Llama 4. Self-hosted multilingual chat where Llama's English bias hurts.

When We Don't

English-first SaaS. Workloads where the licence diligence isn't worth the lift for marginal gain over Llama 4.

Paiteq Pattern

We reach for Qwen on 1 in 8 builds, usually multilingual SaaS expanding into APAC or enterprises with code-search workloads.

MultilingualCodePermissive

Two patterns worth flagging. First, we benchmark three models against the eval set before locking the stack, usually Sonnet, GPT-5, and one open-weights candidate (Llama 4 70B or Qwen 3). The eval set decides, not the leaderboard, not the demo video. Second, we default to a two-tier routing stack on Production Builds, flagship for hard queries, cheap-tier for easy ones, classifier router in the middle. The 40–70% cost cut almost always pays back the routing complexity. Skip routing only when traffic is too low for the savings to matter or when the workload is uniformly hard. Our deeper take on hosted vs self-hosted economics lives in our hosted-vs-self-hosted analysis.

004 / WHERE LLMs SHIP

Where LLMs deliver, capability × industry.

Capability rows × industry columns. Cell strength reflects production volume in our work, not theoretical fit. Empty cells mean we either haven't shipped it yet or the workload didn't justify an LLM.

Capability Industry

B2B SaaS

Fin-tech

Health-tech

Legal

Mfg

E-comm

Ed-tech

Logistics

Custom LLM apps

Internal copilots

Doc extraction

Voice / multimodal

Classification

Fine-tuned models

Custom LLM apps

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Internal copilots

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Doc extraction

B2B SaaSFin-techHealth-techLegalMfgE-commLogistics Ed-tech

Voice / multimodal

B2B SaaSFin-techHealth-techE-commEd-techLogistics LegalMfg

Classification

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Fine-tuned models

B2B SaaSFin-techHealth-techLegalMfgEd-tech E-commLogistics

Possible fit Good fit Primary vertical

Heaviest columns: fin-tech, health-tech, legal. The pattern isn't surprising, those are the industries where structured doc extraction, regulated workflows, and high-stakes refusal pay back hardest. Lightest column: ed-tech, where workloads tilt toward generation and personalisation more than analysis. The grid isn't a roadmap; it's a record. If your industry's column looks thin and your use case sounds promising, that's often where the most interesting llm consulting engagements come from, fewer prior comparisons, more white space.

005 / DEPLOY

Hosted, self-hosted, or hybrid, pick the right deployment.

The most common scoping question on any llm development services engagement. The answer depends on residency rules, steady-state volume, and workload mix. Walk the picker; it'll get you to one of five recommendations in two or three questions.

Question

Pick one

Most clients land on hosted with a routing layer, Sonnet plus GPT-5 mini in a two-tier stack, or GPT-5 plus mini, depending on procurement constraints. About a quarter end up self-hosted, almost entirely health-tech and finance with hard residency rules. Hybrid is the rarer path but real for clients with mixed regulated and non-regulated traffic. Our piece on hosted-vs-self-hosted economics walks through the break-even math; below ~100k requests/day, hosted almost always wins on TCO.

006 / PATTERNS

Four LLM application architectures we ship.

Most production llm application development collapses onto one of these four shapes. The shape decides where the eval gates land, where the cost lives, and where the failure modes are.

Single-prompt app

The simplest production shape. A stable system prompt, prompt caching for cost, structured output for reliability, one model. Most internal copilots that don't need state or tool-use start here and stay here.

Pick when

Single-turn or stateless multi-turn
output is text or a constrained JSON schema
latency budget allows one round-trip
cost is bounded by per-request length, not session state.

Skip when

Multi-step reasoning that benefits from chained models
tool use
long-running sessions where state matters
workloads where each call shape varies enough that caching doesn't hit.

Stack

Claude Sonnet 4.6 or GPT-5Prompt cachingStructured output (Pydantic / Zod schemas)Llama Guard

In practice, most production stacks compose two of these patterns. A routed two-tier stack with the flagship tier itself being a prompt chain is common. A single-prompt app on the flagship tier plus a fine-tune adapter on the cheap tier shows up on the high-volume self-hosted side. The right composition isn't pre-decided; it falls out of the eval set and the cost model during weeks 3–5 of a Build. We'll talk you out of patterns that look interesting but don't match your workload, premature complexity is the most common llm development mistake we see in audits.

007 / EVAL

Four gates on every production LLM app.

Eval-first isn't a slogan; it's a build-order decision. The eval set lands in week 2, before any model is picked. You can't choose a model without a way to measure what "good" means on your workload.

All four gates green before any production wire-up. If one's amber, we rework it in place; if it's red, we re-baseline the model choice or the prompt structure. The gates are the most important part of our llm development services, they're what stops 'feels good in the demo' from shipping to production.

01 Task score

≥94%

App-specific grader on a 30–80-example domain-expert-graded eval set. Lands in week 2 before any model selection. The eval set is your most important artifact; we facilitate, your domain expert grades.

If <90%, we re-baseline against a different model tier or revisit the prompt. We've never shipped below 90% on the eval set, full stop.
02 Hallucination

<2%

LLM-as-judge (Claude Sonnet 4.6) checking whether every claim is supported, plus human spot-check on the disputed 5%. For grounded apps with RAG, faithfulness is the equivalent metric.

If ≥5%, we widen the refusal threshold and rerun. Refusal is a feature, not a failure mode, confident wrong answers are the worse outcome.
03 P95 TTFT

<800ms

Time to first token, measured across the router and any pre-generation safety classifier. Streaming UX target. Voice apps target sub-400ms; voice runs on a leaner stack.

If breached on the production traffic shape for >72h, we tune routing thresholds, swap to a faster model on the slow tier, or move to streaming generation where we weren't already.
04 P50 cost per request

Modelled at discovery

Per-request cost tracked weekly post-launch via Langfuse. Modelled during the Pilot using expected traffic shape; surprise bills aren't a surprise because the modelling lands in week 3.

If cost drifts >25% over the modelled baseline for two weeks, we audit the router thresholds and the system-prompt cache hit rate, usually one of those two is the culprit.

The four gates above are the floor. For specific workloads we add more: refusal rate (what fraction of queries the model declines, and how that tracks against the adversarial eval set), citation accuracy (for grounded apps, see our grounded retrieval pipelines), format compliance (for structured-output apps, % of outputs that parse against the Pydantic / Zod schema first try). Add metrics only when the workload demands them, gate proliferation slows iteration without lifting quality.

008 / COST

LLM cost engineering, model routing, caching, batching.

Cost engineering is where most of the practical ROI on a production LLM build lives. Five levers cover ~90% of the savings we ship.

Model routing

A classifier (often the cheap-tier model itself) sends easy queries to GPT-5 mini or Haiku; hard queries to Sonnet or GPT-5. Most apps see 60–75% of traffic route to the cheap tier with quality holding. Typical cut: 40–70% of token spend.

Prompt caching

Anthropic prompt cache hits run 80–90% on stable system prompts; OpenAI's automatic cache catches the equivalent on chat completions. We design system prompts to maximise cacheable prefixes from day one. Typical cut: 70–85% of input-token cost on the cached prefix.

Batch API

For non-interactive workloads, overnight summarisation, batch classification, embedding regeneration, the OpenAI Batch API and Anthropic's batch tier knock ~50% off list at the cost of higher latency (up to 24h). Typical cut: 50% on batch-eligible traffic.

Context windowing

Trim the input. Summarise prior turns; retrieve only the chunks that score above a threshold; cap response length aggressively when the format doesn't need more. Boring engineering work that quietly cuts the per-call cost. Typical cut: 15–30% of token spend.

Self-host at break-even

Above ~1M requests/day on a single workload, self-hosted Llama 4 on vLLM with continuous batching usually flips the economics. The trigger isn't a slogan, it's where the dedicated A100 pool plus ops overhead crosses the hosted bill. Typical cut at break-even: 60–80% of hosted spend.

The order matters. Routing and caching together usually deliver 60–80% of the savings before any other lever fires. Batching catches the next chunk for the workloads that fit. Context windowing is the boring engineering work that compounds. Self-hosting is the heavy lever that only fires when volume justifies it, premature self-hosting is the most common cost-engineering mistake we see in audits. Our deeper take lives in our LLM cost engineering deep dive.

009 / OBSERVABILITY

LLM observability, what we instrument.

Production LLM apps fail differently than other production systems. The traces have to capture cost, quality, and latency together, and feed back into the eval set every week.

We default to Langfuse as the trace store on most builds; Helicone when the team wants a thinner proxy-only setup; LangSmith when LangChain is already the orchestration layer. The tool matters less than what gets instrumented. Every production call captures: model name, prompt hash, cache hit / miss, input tokens, output tokens, cost, p95 latency contribution, eval-grader score (sampled), and a user / team / tenant tag for downstream cost attribution.

Production traces feed the eval set monthly. We sample ~1% of production calls into a review queue; the domain expert grades a subset; the graded examples land in the eval set as regression cases. This is what stops the slow drift that hits most LLM apps 90 days post-launch, a model upgrade or a prompt edit subtly changes the failure mode, and without sampled feedback you don't notice until users complain. Regression alarms fire on any 5-point drop on any of the four gates.

For high-stakes workloads we add a second layer: per-tenant cost budgets enforced at the proxy, per-user rate limits to catch runaway scripts, and an alert when the cache hit rate drops more than 10 points in a 24h window (usually the canary for a system-prompt edit that broke caching). About a quarter of LLM App Audits we run trace problems back to observability gaps, you can't fix what you can't see.

010 / FINE-TUNING

Fine-tuning, when it's worth it.

Llm fine tuning services live at a specific decision point: when prompting and retrieval don't cover the workload but a small domain dataset does. Most engagements don't need fine-tuning; the ones that do tend to be obvious.

We support three fine-tuning shapes. LoRA on open-weights bases (Llama 4, Mistral, Qwen) for domain-vocab and output-format problems, usually 1,500–10,000 examples, single A100 80GB, 8–24 hours of training time, ~$300–$1,200 per run. QLoRA for the same shapes when memory is constrained, with a small quality trade-off versus full LoRA on 70B models. Hosted fine-tunes on OpenAI or Anthropic when the base model has to stay frontier-class and the dataset is small enough to be cost-effective at hosted-fine-tune pricing.

The order is always the same. Baseline the prompt-only path against the eval set. If the baseline is good enough, ship it, about a third of fine-tuning engagements end here, and that's a successful outcome. If the baseline is not good enough, build the fine-tuning dataset (this is usually the bottleneck, not the training run), run the fine-tune, score against the same eval set. Deploy only if the fine-tune wins on the eval set with a margin that justifies the iteration cost. We've shipped engagements where the fine-tune lost by 1–2 points and we still recommended sticking with prompt-only because the iteration overhead wasn't worth the marginal quality.

A specific real example: a health-tech client's clinical-note structuring task. GPT-5 baseline scored F1 ~71 on the clinician-graded eval set. We benchmarked a QLoRA on Llama 4 70B trained on 8,400 redacted notes, F1 ~85, plus the data stayed in-cloud and per-call cost dropped 60%. The fine-tune shipped. Same shape, different client, finance domain: baseline scored 89, fine-tune scored 91. We didn't ship, the 2-point lift didn't justify the operational ownership of an adapter. Most fine-tuning decisions are this kind of close call; the eval set has to be the tiebreaker, not the wishful thinking.

011 / PROMPT

Prompt engineering at scale.

Prompt engineering services on production LLM systems are mostly version control, eval correlation, and cost discipline, not clever phrasing. The cleverness happens in the eval set design.

Production prompts live in version control, not in a notebook or a CMS. Every prompt change ships with a commit message, a diff, and an eval-set re-run. If the eval scores drift, we know which prompt edit caused it, usually within a few hours of the change landing. The prompt versioning pattern we ship on most Production Builds keeps the active system prompt as a git artifact with a SHA pinned in the deploy config; rollback is a config change.

For prompt patterns themselves: structured-output enforcement via Pydantic or Zod schemas catches more than chain-of-thought tricks on most production workloads, the model can't drift if the output has to parse. Few-shot examples live in a separate corpus that the prompt builder injects at request time; this lets us A/B example sets without touching the system prompt. Caching prefixes are designed in from day one, anything stable goes at the top of the prompt; anything dynamic goes after, where it doesn't bust the cache. System-prompt isolation prevents user content from being interpreted as instructions; we wrap user input in unambiguous delimiters and tell the model so explicitly.

We ship our prompt engineering services as part of the Build, never as a standalone, prompting without eval is just creative writing. For teams that want the prompts handed off cleanly, every Production LLM Build ends with a documented prompt library plus a small prompt-versus-fine-tune decision memo covering when to escalate which knob next.

012 / PROCESS

How a build runs, eval-first, every time.

The same six-step process runs across a 4-week Pilot and a 16-week Production Build. The gates change in depth, not in shape. Every step has a deliverable, a named owner, and a gate criterion, pass or rework.

WEEK 1

Discovery

Workload shape, eval surface, cost target, residency posture. Models aren't picked yet, that's week 3.

WEEK 2

Eval set

30–80 domain-expert-graded examples covering main paths and edge cases. Lands before any prompting.

WEEK 2–4

Baseline

Three to four models scored against the eval set. Cost-adjusted quality wins, not benchmark theatre.

WEEK 4–8

Iteration

Prompt, routing, retrieval (if RAG), or fine-tune (if the data justifies it). Each change re-scored.

WEEK 8+

Deploy

Auth, rate-limit, Langfuse observability, model fallback, cost guardrails, regression alarms.

ONGOING

Running

Weekly eval, drift alarms, prompt iteration log, model-upgrade regression checks. The eval set grows.

013 / VS

Hosted versus self-hosted, the side-by-side.

A reference table covering the practical trade-offs. The picker above gets you a recommendation; this table gives you the numbers to defend it in a procurement conversation.

	Hosted (provider)	Self-hosted (yours)
Hosted (OpenAI / Anthropic / Vertex)	Fast to ship, top-of-class quality, per-token pricing, no inference infra to run
For most teams below ~500k requests/day, hosted is strictly faster to ship and cheaper to operate. No GPU cluster to provision, no CUDA dependency hell, no on-call rotation for inference infra. The switching cost if you later need self-hosted is a model swap, not a rewrite.
Self-hosted (Llama / Mistral / Qwen on your cloud)		Data stays in your cloud, fixed infra cost, model under your control, custom quantization + adapters
Self-hosting is the only option when a data-residency rule forbids third-party inference, common in EU healthcare (GDPR + national health-data law), defence-adjacent contracts, and financial services with ring-fenced data mandates. It's also the only path to QLoRA/LoRA fine-tunes that you own outright.
Best when	Quality > cost; no residency rule; team has no ops capacity	Regulated data; very high volume; or strategic need to own the model
Latency floor	Provider-dependent; rarely below 600ms TTFT for flagships	Sub-200ms TTFT achievable with vLLM continuous batching
vLLM's continuous batching scheduler saturates GPU memory instead of waiting for fixed batch boundaries, at typical QPS, p50 TTFT on a single A100 running Llama 3 70B sits around 130–180ms. Provider network round-trips alone add 80–150ms before a single token is returned; flagship models (GPT-5, Claude Opus) sit materially above 600ms TTFT at launch load.
Per-1K-tokens cost	$0.30–$15 input/output (flagships); $0.05–$0.50 (small models)	$0.05–$0.40 amortised on dedicated A100s once volume crosses the break-even
The ranges genuinely overlap. At low volume the amortised A100 cost looks great on paper but the fixed reservation cost ($10–25k/mo per A100 on-demand) dominates unless utilisation is high. At very high volume the per-token numbers flip, but the real lever is utilisation rate, not raw token price.
Operational load	Low, provider runs inference, you run the app	Higher, you (or we) run GPU pools, batching, quantization, autoscaling
Break-even point	<100k requests/day, hosted almost always wins	>1M requests/day on a single workload, self-hosted economics flip

Full breakdown, when to pick which

014 / USE CASES

Where teams have shipped.

Six anonymised engagements across recent quarters. Workloads, segments, and outcome metrics are real; brand removed under NDA.

Internal copilot

Fin-tech · 800+ emp

Internal LLM assistant on Claude + private corpus

Slack-deployed advisor that pulls from a redacted Confluence + Snowflake corpus. Refuses cleanly on out-of-corpus. Auth scoped per team; Langfuse traces every call; cost capped per-user per-day. The eval set is graded monthly by two senior analysts.

0 %

internal-knowledge tickets

Fine-tune

Health-tech SaaS · 50–200 emp

QLoRA fine-tune on Llama 4 for clinical note structuring

8,400 redacted clinical notes, QLoRA fine-tune on a single A100 80GB, eval against a clinician-graded reference set. Self-hosted on the client's AWS via vLLM. We benchmarked GPT-5, Sonnet, and the fine-tune; the fine-tune won on F1 and on per-request cost simultaneously.

F1 + points over GPT-5 baseline

Cost engineering

B2C app · 1M+ MAU

Model routing + prompt caching on a GPT-5 stack

A small intent classifier routes ~70% of traffic to GPT-5 mini and ~30% to GPT-5. Anthropic-style prompt caching on the stable system prompt cut the cached portion by 85%. Eval gates guard the routing thresholds, quality on both tiers tracked weekly.

0 %

token spend , p95 unchanged

Voice

Health-tech · enterprise

Voice intake on GPT-5 Realtime, sub-400ms p95

Patient intake voice agent on GPT-5 Realtime API with a custom safety filter pre-generation. Voice-RAG pulled from a HIPAA-compliant Postgres index running BGE embeddings on the same VPC. We sit at p95 ~360ms TTFT; clinical reviewer signed off on faithfulness at 96%.

0 %

p95 360ms TTFT, faithfulness

Doc extraction

Logistics · 200+ emp

Bill-of-lading extraction with GPT-5 Vision + structured output

Scanned BoLs across 14 carriers. GPT-5 Vision extracts to a Pydantic schema with structured output, validated against a rules engine, and posted into the TMS. Carrier-specific layouts get few-shot examples; the rare edge case routes to human review with the model's draft attached.

OCR-to-TMS 14m → 90s per BoL

Multilingual

Ed-tech · APAC

Qwen 3 72B for Chinese-first tutor copilot

Self-hosted Qwen 3 72B on a 4×A100 pool for a Chinese / English bilingual tutoring app. We benchmarked it against Llama 4 70B and GPT-5; Qwen won on Chinese fluency and cost, GPT-5 won on English. The app routes by detected language at the prompt layer.

0 %

Bilingual quality parity, cost cut

015 / CONSULTING

LLM consulting services, advisory engagements.

Sometimes the right answer isn't "build the app." Our llm consulting services cover the strategic decisions that need to land before any code ships, and occasionally the engagement ends "don't build, buy this off-the-shelf tool instead."

Model selection audit

Two-week engagement: workload audit, eval-set design, head-to-head benchmark of 3–5 candidate models, build-vs-buy-vs-fine-tune memo.

LLM cost projection

TCO model over 12 months: hosted vs self-hosted, routing scenarios, prompt-cache assumptions, traffic-growth sensitivities. Lands in 1–2 weeks.

Provider roadmap

OpenAI vs Anthropic vs Vertex vs open-weights, with a procurement + risk lens. Useful when leadership needs the trade-offs on paper.

Build-vs-buy assessment

Custom app vs SaaS vs hybrid. Honest stop-recommendation when an off-the-shelf tool covers your workload, that's about 1 in 5 consulting engagements.

Llm consulting is what we run when the question is "should we build this?" not "how do we build this?", a 1–2 week engagement with a written memo, sometimes a benchmark, sometimes a TCO model. About one in five consulting engagements ends with "don't build, here's the SaaS that does this," which is a successful outcome we sometimes have to talk new clients into believing we mean. For broader strategic AI work that spans LLM + RAG + agents, see our strategy advisory practice.

016 / ENGAGE

Four ways to start.

01 LLM App Pilot Fixed scope

2–4 weeks

Pilot one workload, intake to live.

In scope

One scoped use case
Eval set (30–80 examples)
3–4 model baseline scored
Working prototype
Demo + build-or-stop memo

Out of scope

Production deploy
Fine-tuning
Multi-workload scope

02 Production Build Fixed scope

8–16 weeks

Full LLM app with eval gates.

In scope

All Pilot deliverables
Auth · rate-limit · observability (Langfuse)
Model routing + cost guardrails
Eval gates baked into the deploy pipeline
Adversarial eval + safety classifier
Four weeks of post-launch iteration

03 Fine-tuning Fixed scope

6–10 weeks

LoRA / QLoRA / hosted fine-tune.

In scope

Dataset curation + cleaning
Eval set + prompt-only baseline
Fine-tune run + eval comparison
Deploy if it wins; recommend prompt-only if not
Weights + adapters + runbook transferred

04 LLM App Audit Fixed scope

2–3 weeks

Eval, cost, latency, safety review.

In scope

Coverage audit on current eval
Cost + latency profiling
Adversarial / jailbreak test
Prioritised fix-list + follow-on quote

017 / FAQ

Common LLM development questions.

Hosted (Claude / GPT) or self-hosted (Llama), how do we decide?

Default to hosted unless you have one of three triggers: a residency rule on the data, very high steady-state volume (above ~1M requests/day on a single workload), or a strategic need to own the model weights. Self-hosting is a meaningful operational commitment, GPU pools, autoscaling, quantization tuning, batching, and we'll only recommend it when the math actually works.

The interactive picker above walks through the decision in 2–3 questions. Most clients we see end up on hosted with an Anthropic Sonnet baseline and GPT-5 mini as the cheap-tier in a routed stack; about a quarter end up self-hosted (mostly healthcare and finance with strict residency). Hybrid is rare but real for clients with mixed regulated and non-regulated traffic. Our piece on hosted vs self-hosted LLMs covers the break-even math in detail.

When does fine-tuning beat prompt engineering and RAG?

Fine-tuning wins on three specific shapes. One: when the output format is so consistent that prompting struggles (a custom JSON schema, a regulatory document format, code in a niche DSL). Two: when the domain language has terms the base model fumbles, clinical jargon, legal idioms, internal product vocabulary that the base model never saw enough of. Three: when latency or cost have to drop and you've exhausted the prompting and routing levers.

Prompt + RAG usually wins outside those three shapes. Fine-tunes are slower to iterate on, harder to debug, and they don't compose with citations the way RAG does. We baseline prompt-only and prompt+RAG before we propose a fine-tune; if either of those clears the eval bar, the fine-tune budget gets spent on the eval set instead. Our llm fine tuning services live inside the Fine-tuning engagement shape above, about a third end in "the fine-tune didn't beat the baseline, so we shipped prompt-only," and that's a win, not a loss.

Composing them is common too: RAG for facts, a small LoRA fine-tune for output style and domain vocab. We scope both at week 2 if the use case has both shapes. The picker on our grounded retrieval pipelines page covers the RAG vs fine-tune decision tree in more depth.

How do you measure LLM application quality?

Four metrics, scored separately because they fail differently:

Task score, app-specific grader on a domain-expert-graded eval set (30–80 examples). The eval set is the most important deliverable of the engagement; your domain expert grades, we facilitate.
Hallucination rate, claims with no factual support, scored by LLM-as-judge (Claude Sonnet 4.6) with human spot-check on the disputed cases. Hard gate before production.
P95 TTFT, full time to first token including any pre-generation safety filter. Voice apps target sub-400ms; chat apps sub-800ms.
P50 cost per request, modelled at discovery, tracked weekly post-launch via Langfuse. Surprise bills aren't a surprise.

The default eval stack is Inspect AI as the harness, RAGAS for any retrieval-adjacent metric, custom graders for app-specific scoring, and Langfuse for production trace sampling that feeds the eval set monthly. Regression alarms fire when any metric drops more than 5 points on a model upgrade. Our piece on eval framework comparison covers when to reach for which tool.

How do you handle prompt injection and jailbreaks?

Defence in depth. Input classifier (Llama Guard or a custom classifier) runs pre-generation on user input; structured-output enforcement constrains what the model can return; system-prompt isolation prevents user content from leaking into instructions; output filter classifies pre-response. Every production deploy ships with a documented threat model and an adversarial eval set, 30–80 known jailbreak attempts plus domain-specific abuse patterns.

We don't claim "unbreakable" because that's not a real thing. We claim measured, we know our refusal rate on the adversarial set, we know our false-positive rate on the benign set, and we know the trade-off we're holding. For high-stakes workloads (clinical, financial advice, legal) we add a second classifier post-generation and a human-in-the-loop for refused-but-disputed cases. Our customer-side budget for safety work is usually 10–15% of the build effort; trying to do this cheaper is the cheap way to ship the wrong thing.

What about cost runaway, how do you prevent surprise bills?

Five mechanisms layered together. Per-request cost budgets: hard caps that refuse the request rather than emit it when the prompt is bigger than expected. Rate limits per user, per team, per endpoint, usually configured via Langfuse or a thin proxy. Model routing: easy queries to GPT-5 mini or Haiku, hard queries to flagships, controlled by a small classifier. Prompt caching on stable system prompts and any prefix that repeats, Anthropic's cache hits often run 80–90% on enterprise system prompts. Batch API for non-interactive workloads, which knocks ~50% off list at the price of higher latency.

Cost is modelled in week 3 of any engagement using the actual expected traffic shape, not a marketing average. Once live, weekly cost reviews catch drift before it compounds; the regression alarm trips on a 25% drift over two weeks. We've seen "surprise" bills usually trace to one of three things, a router threshold miscalibration, a cache hit rate that quietly collapsed after a system-prompt edit, or a runaway agent loop that didn't have a step-budget. All three are observability-detectable if you instrument the right things on day one.

Can you migrate us from one model provider to another?

Yes. Provider migration is part of about a quarter of our LLM App Audit engagements. The pattern is dual-write first (calls go to both providers for a sample of traffic), eval parity checks against your real eval set, then read-cutover with a fast rollback path. We've done OpenAI → Anthropic, Anthropic → Vertex, and hosted → self-hosted migrations with zero downtime when the abstraction was clean.

The abstraction matters more than the migration. We bake a thin provider-abstraction layer into every Production Build for exactly this reason, model swaps shouldn't require touching app code. If you're locked into a vendor SDK now, the migration scope grows because we have to abstract first, then migrate; that's a 6–10 week engagement instead of 3–4. Either way, the eval set decides whether the migration ships, not vendor benchmarks.

Who owns the prompts, fine-tuned weights, and the eval set?

You do. All artifacts transfer into your repository under the SOW: system prompts, few-shot examples, eval set, fine-tuned adapter weights, the Langfuse instance configuration, the runbook. We retain no rights to your prompts, weights, or data. Paiteq keeps engineering learnings, patterns, methodologies, anonymised case-study takeaways for our internal playbook, but never your specific artifacts.

This matters more than people realise on a first build. We've onboarded several clients whose previous vendor "owned" the prompts as IP, which made provider migration impossibly expensive. We refuse that pattern. Your business knowledge lives in the eval set and the prompts; treating either as vendor IP would be malpractice.

What's a realistic budget and timeline for production LLM development services?

The four engagement shapes above are fixed-scope and fixed-duration; we hold the price band on the contact call because workload depth, residency posture, and integration count swing the budget meaningfully. Rough order of magnitude:

LLM App Pilot (2–4 weeks): small enough that stopping at the pilot is a real option. About one in three pilots end at the pilot because the eval surface wasn't measurable or the workload turned out to be a generation problem better served by a generative AI engagement.
Production LLM Build (8–16 weeks): the bulk of our llm development services revenue. Includes four weeks of post-launch iteration baked into the SOW.
Fine-tuning engagement (6–10 weeks): includes the head-to-head against the prompt-only baseline. We've shipped engagements that ended "the fine-tune didn't beat the baseline, so we shipped prompt-only", that's a successful outcome.
LLM App Audit (2–3 weeks): outputs a prioritised fix-list and a fixed-scope follow-on quote if you want us to ship the fixes.

For llm consulting services (model selection audits, cost projection, provider roadmap), 1–2 week engagements at a flat fee. The full breakdown lives in our strategy advisory practice for cross-service strategic work.

Where LLM application development connects.

The honest sequencing: LLM app on its own, then RAG development services when the answers need to be grounded in your data, then our AI agent development company practice when the workflow has more than one decision. The cost-engineering layer underneath all three (provider routing, caching, fallback) lives in AI integration services. The serving + observability + drift layer underneath production deployment lives in MLOps services, with the upstream classical ML side in machine learning development services. And the strategic framing of which custom LLM app to build first usually starts in AI consulting services.

For deeper operator context on agent orchestration patterns, the multi-agent orchestration patterns post walks the supervisor / hierarchical / mesh decision. Regulated-industry buyers: AI healthcare software development is the most common entry shape (PHI handling, BAA-backed deployment, citation enforcement on clinical answers); AI for fintech LLM apps are a close second (model-risk-aware deployments, SR 11-7 audit trails), with custom AI insurance development LLM apps a strong third (claims-narrative summarization, regulator-defensible drafting), and logistics software development company LLM apps fourth (exception-narrative drafting, carrier-doc summarization). Broader context: the Paiteq engineering practice and the founder behind it, Navin Sharma. Legacy-estate buyers should sequence through AI migration services before scoping new LLM apps.

018 / Related practices

Adjacent services.

RAG DEVELOPMENT

RAG Development

Retrieval-augmented generation systems with evaluation built in.

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

AI CONSULTING

AI Consulting

AI strategy, audits, roadmap.

019 / Start a project

Let's ship the LLM app.

Pilot in 2–4 weeks. Production build in 8–16. Fine-tune in 6–10. Audit in 2–3.

Talk to engineering Architecture review

Production llm development services on Claude · GPT-5 · Llama 4.

Eight LLM workloads we ship.

LLM development services from an LLM development company, pick where to start.

Models, when each one wins.

Where LLMs deliver, capability × industry.

Hosted, self-hosted, or hybrid, pick the right deployment.

Four LLM application architectures we ship.

Single-prompt app

Prompt chain

Model routing

Fine-tune + adapter

Four gates on every production LLM app.

LLM cost engineering, model routing, caching, batching.

Model routing

Prompt caching

Batch API

Context windowing

Self-host at break-even

LLM observability, what we instrument.

Fine-tuning, when it's worth it.

Prompt engineering at scale.

How a build runs, eval-first, every time.

Discovery

Eval set

Baseline

Iteration

Deploy

Running

Hosted versus self-hosted, the side-by-side.

Where teams have shipped.

Internal LLM assistant on Claude + private corpus

QLoRA fine-tune on Llama 4 for clinical note structuring

Model routing + prompt caching on a GPT-5 stack

Voice intake on GPT-5 Realtime, sub-400ms p95

Bill-of-lading extraction with GPT-5 Vision + structured output

Qwen 3 72B for Chinese-first tutor copilot

LLM consulting services, advisory engagements.

Model selection audit

LLM cost projection

Provider roadmap

Build-vs-buy assessment

Four ways to start.

Pilot one workload, intake to live.

Full LLM app with eval gates.

LoRA / QLoRA / hosted fine-tune.

Eval, cost, latency, safety review.

Common LLM development questions.

Where LLM application development connects.

Adjacent services.

Let's ship the LLM app.