Production llm development services on Claude · GPT-5 · Llama 4.
Paiteq is an llm development company shipping custom language model applications with evaluation built in — hosted or self-hosted, with auth, observability, cost guardrails, and model routing wired before the first user lands. Not a notebook with a chat box.
Eight LLM workloads we ship.
Each surface below is a workload we've shipped to production — with the eval methodology, the model choice, and the cost engineering already worked out.
LLM application development sorts cleanly by workload shape, not by industry. A clinical-note copilot for a 50-person health-tech reuses the same eval methodology, the same model-routing logic, and the same observability stack as a sales-research copilot for an 800-person fin-tech. The integrations differ, the residency posture differs, the prompts differ — but the engineering shape doesn't. Sorting by workload is what lets us reuse the eval harness, the prompt-versioning patterns, and the cost-monitoring playbook across clients instead of reinventing them every engagement.
Heavier surfaces this year: internal copilots, doc extraction, model routing. Lighter: multimodal voice, where the latency budget rules out about half of candidate workloads, and fine-tuned models, where the data prep cost stops many engagements at "the baseline already covers it." We'll talk you out of an LLM-shaped engagement if the workload is actually a retrieval problem — better fit for grounded retrieval pipelines — or an autonomous task-execution problem — better fit for agent loops with state. About 15% of inbound LLM development services inquiries get redirected before the first call.
LLM development services from an LLM development company — pick where to start.
Four engagement shapes. Each is fixed-scope and fixed-duration. You always know what's coming, when, and what counts as done.
Choosing an llm development company is mostly choosing the right starting shape. Buyers who come in with a scoped use case and a way to measure success ship to production around 80% of the time. Buyers who come in with "we want LLM somewhere in the product" ship around 30% of the time, usually after a re-scope. The four shapes below map onto the starting points we see most often — pick the one that matches what you have, not what you wish you had.
If the workload is scoped but the model quality is unproven, start with a Pilot. If you know the workload works and you need production discipline (observability, eval gates, routing, deploy), start with a Production LLM Build. If you've already shipped LLM and it's underperforming or surprise-billing, start with an Audit. If the data tells you a fine-tune is justified, start with a Fine-tuning engagement — but the audit usually comes first because the eval set has to exist before fine-tuning can be evaluated. Week-by-week scope on each is further down the page.
Models — when each one wins.
Model choice follows workload, not house preferences. Cost-adjusted quality against your eval set decides — every time, no exceptions.
- Claude
- GPT-5
- GPT-5 mini
- Gemini
- Llama 4
- Mistral
- Qwen
- vLLM
- TGI
- Modal
- Replicate
- Langfuse
- Helicone
- Inspect AI
- Llama Guard
- Presidio
- Claude
- GPT-5
- GPT-5 mini
- Gemini
- Llama 4
- Mistral
- Qwen
- vLLM
- TGI
- Modal
- Replicate
- Langfuse
- Helicone
- Inspect AI
- Llama Guard
- Presidio
For each model: what it's strongest at, when we pick it, when we don't, and the specific Paiteq pattern we use with it. We've shipped production llm development on every model below. The "when we don't" lines come from real builds — usually a moment in week 4 where the eval set told us to swap.
Strongest tool-call accuracy in our eval set across 2025–2026. Excellent at structured output. Prompt caching cuts cost ~80% on stable system prompts. Vision is competitive with GPT-5 on OCR-grade extraction.
Default for agentic workloads where the model holds state and calls tools. Default for any app that streams JSON. Default when prompt caching unlocks meaningful spend reduction (long stable system prompts).
Workloads that need a 128k+ context window with reliable recall — Gemini 3.0 Pro wins there. Hyper-cost-sensitive batch workloads where Haiku or GPT-5 mini covers the quality bar.
Our day-one baseline on most production LLM development services. About half the apps we ship in 2026 run on Sonnet plus prompt caching.
Strongest multimodal — vision, voice, real-time bidirectional audio. Latency-tuned for streaming UIs. Robust function-calling. Batch API at ~50% off list for non-interactive workloads.
Multimodal workloads, voice apps via the Realtime API, and any client with an existing OpenAI contract where procurement won't add Anthropic. Batch summarisation pipelines.
Pure tool-call agentic workloads — Sonnet wins our evals there. Apps with strict refusal policies — GPT-5's safety surface is looser than Claude's by default.
Default for voice and vision pipelines. We pair it with GPT-5 mini as the routing fallback on most production deploys.
10–20× cheaper per token than the flagships. Latency 2–3× faster. Good enough for ~70% of routing-tier traffic on most apps once a router classifier is in place.
As the cheap-tier in a routed stack. Easy classification, short summarisation, slot-filling extractors. Anywhere the eval set says the small model holds quality.
Hard reasoning, multi-step tool use, long context with citation. We've seen too many builds try to push these down to mini and watch task scores drop 15–25 points.
Roughly two-thirds of cost engineering work is just plumbing in a router that sends easy queries here. Spend cuts of 40–70% are common, quality holds when the eval set guards the routing thresholds.
Open weights. Data never leaves your perimeter. Fixed infra cost beats per-token billing above ~2M requests/day on dedicated GPUs. Tunable — quantization, LoRA adapters, custom vocab.
Regulated data with hard residency rules (healthcare PHI, finance MNPI, EU AI Act high-risk workloads). Very high-volume batch workloads where the math flips. Anywhere the team has the ops capacity to run an inference service.
Mixed-quality workloads where a flagship's edge matters per call. Teams without Kubernetes / GPU ops capacity — the operational tax shows up around month two.
We run Llama 4 70B on vLLM with A100 80GB nodes for residency-constrained clients. Sub-200ms TTFT achievable with continuous batching. ~$0.08 per 1K output tokens amortised at the workloads we see.
2M-token context window with usable recall. Strong multimodal — video and document understanding at length. Aggressive pricing on long-context workloads. Native Google Cloud integration.
Document-stack workloads where the corpus fits in context — usually under 1.5M tokens. Long-form synthesis. Clients standardised on Vertex AI for procurement reasons.
Tool-call-heavy agents — Sonnet still leads our evals there. Workloads where the long-context advantage doesn't pay back the per-token cost on the actual queries (most are still short).
Default for long-document Q&A where the customer would rather pay the per-token cost than maintain a retrieval pipeline. We'll talk you back to RAG if the corpus grows past ~1.5M tokens.
Strong Chinese / Asian-language performance, competitive coding benchmarks, permissive licence. Sizes from 0.5B to 72B cover edge-to-cloud. Aggressive on multilingual workloads.
Multilingual workloads with heavy Chinese / Japanese / Korean traffic. Code-heavy domains where Qwen-Coder lifts pass-rate over Llama 4. Self-hosted multilingual chat where Llama's English bias hurts.
English-first SaaS. Workloads where the licence diligence isn't worth the lift for marginal gain over Llama 4.
We reach for Qwen on 1 in 8 builds — usually multilingual SaaS expanding into APAC or enterprises with code-search workloads.
Two patterns worth flagging. First, we benchmark three models against the eval set before locking the stack — usually Sonnet, GPT-5, and one open-weights candidate (Llama 4 70B or Qwen 3). The eval set decides, not the leaderboard, not the demo video. Second, we default to a two-tier routing stack on Production Builds — flagship for hard queries, cheap-tier for easy ones, classifier router in the middle. The 40–70% cost cut almost always pays back the routing complexity. Skip routing only when traffic is too low for the savings to matter or when the workload is uniformly hard. Our deeper take on hosted vs self-hosted economics lives in our hosted-vs-self-hosted analysis.
Where LLMs deliver — capability × industry.
Capability rows × industry columns. Cell strength reflects production volume in our work, not theoretical fit. Empty cells mean we either haven't shipped it yet or the workload didn't justify an LLM.
Heaviest columns: fin-tech, health-tech, legal. The pattern isn't surprising — those are the industries where structured doc extraction, regulated workflows, and high-stakes refusal pay back hardest. Lightest column: ed-tech, where workloads tilt toward generation and personalisation more than analysis. The grid isn't a roadmap; it's a record. If your industry's column looks thin and your use case sounds promising, that's often where the most interesting llm consulting engagements come from — fewer prior comparisons, more white space.
Hosted, self-hosted, or hybrid — pick the right deployment.
The most common scoping question on any llm development services engagement. The answer depends on residency rules, steady-state volume, and workload mix. Walk the picker; it'll get you to one of five recommendations in two or three questions.
Most clients land on hosted with a routing layer — Sonnet plus GPT-5 mini in a two-tier stack, or GPT-5 plus mini, depending on procurement constraints. About a quarter end up self-hosted, almost entirely health-tech and finance with hard residency rules. Hybrid is the rarer path but real for clients with mixed regulated and non-regulated traffic. Our piece on hosted-vs-self-hosted economics walks through the break-even math; below ~100k requests/day, hosted almost always wins on TCO.
Four LLM application architectures we ship.
Most production llm application development collapses onto one of these four shapes. The shape decides where the eval gates land, where the cost lives, and where the failure modes are.
Single-prompt app
The simplest production shape. A stable system prompt, prompt caching for cost, structured output for reliability, one model. Most internal copilots that don't need state or tool-use start here and stay here.
- Single-turn or stateless multi-turn
- output is text or a constrained JSON schema
- latency budget allows one round-trip
- cost is bounded by per-request length, not session state.
- Multi-step reasoning that benefits from chained models
- tool use
- long-running sessions where state matters
- workloads where each call shape varies enough that caching doesn't hit.
Prompt chain
Split a hard task into stages. Use a cheap model (GPT-5 mini, Haiku) for extraction and formatting; reserve the flagship for the reasoning step in the middle. Each stage has its own eval target; failure modes are localised.
- Hard reasoning task with cheap pre/post-processing
- structured input or output where the format work doesn't need the flagship's quality
- cost-sensitive workloads where the flagship is overkill at the edges of the chain.
- Tasks where the stages share too much context — the chain re-prompts the context at each call
- that gets expensive fast. Tool-call agentic workloads, where the chain shape forces unnatural rigidity.
Model routing
A small classifier (often the cheap-tier model itself) routes each query to the right tier. Easy queries go to GPT-5 mini or Haiku; hard queries go to Sonnet or GPT-5. Routing thresholds are guarded by eval gates on both tiers. The single highest-ROI cost engineering pattern in our playbook.
- Mixed-difficulty traffic above ~100k requests/day where the flagship-only bill becomes the constraint
- workloads where you have a measurable success metric on both tiers (eval gates can guard the threshold)
- long-running production apps where iteration on the router pays back.
- Pre-launch builds — you don't know the traffic shape yet. Workloads where the cheap-tier quality is consistently below threshold (just use the flagship). Tiny daily volume where the routing complexity costs more than the model spend it saves.
Fine-tune + adapter
A small LoRA adapter trained on 1.5–10k domain examples, stacked on a frozen base (Llama 4, Mistral, Qwen). Adapter weights are under 200MB and load in seconds at inference. Cheap to run, cheap to iterate, easy to roll back. Works for output-style and domain-vocab problems; doesn't replace RAG for facts.
- Domain vocabulary the base fumbles
- consistent output format the prompt can't reliably enforce
- high-volume self-hosted workloads where the per-call cost matters
- latency requirements that rule out long few-shot prompts.
- Small datasets (under ~800 clean examples) — the fine-tune won't move the needle. Workloads where the source of truth changes weekly — adapters go stale and re-training is overhead. Closed-source models where adapters aren't supported (you're stuck with hosted fine-tunes).
In practice, most production stacks compose two of these patterns. A routed two-tier stack with the flagship tier itself being a prompt chain is common. A single-prompt app on the flagship tier plus a fine-tune adapter on the cheap tier shows up on the high-volume self-hosted side. The right composition isn't pre-decided; it falls out of the eval set and the cost model during weeks 3–5 of a Build. We'll talk you out of patterns that look interesting but don't match your workload — premature complexity is the most common llm development mistake we see in audits.
Four gates on every production LLM app.
Eval-first isn't a slogan; it's a build-order decision. The eval set lands in week 2 — before any model is picked. You can't choose a model without a way to measure what "good" means on your workload.
All four gates green before any production wire-up. If one's amber, we rework it in place; if it's red, we re-baseline the model choice or the prompt structure. The gates are the most important part of our llm development services — they're what stops 'feels good in the demo' from shipping to production.
- 01 Task score≥94%
App-specific grader on a 30–80-example domain-expert-graded eval set. Lands in week 2 before any model selection. The eval set is your most important artifact; we facilitate, your domain expert grades.
If <90%, we re-baseline against a different model tier or revisit the prompt. We've never shipped below 90% on the eval set, full stop.
- 02 Hallucination<2%
LLM-as-judge (Claude Sonnet 4.6) checking whether every claim is supported, plus human spot-check on the disputed 5%. For grounded apps with RAG, faithfulness is the equivalent metric.
If ≥5%, we widen the refusal threshold and rerun. Refusal is a feature, not a failure mode — confident wrong answers are the worse outcome.
- 03 P95 TTFT<800ms
Time to first token, measured across the router and any pre-generation safety classifier. Streaming UX target. Voice apps target sub-400ms; voice runs on a leaner stack.
If breached on the production traffic shape for >72h, we tune routing thresholds, swap to a faster model on the slow tier, or move to streaming generation where we weren't already.
- 04 P50 cost per requestModelled at discovery
Per-request cost tracked weekly post-launch via Langfuse. Modelled during the Pilot using expected traffic shape; surprise bills aren't a surprise because the modelling lands in week 3.
If cost drifts >25% over the modelled baseline for two weeks, we audit the router thresholds and the system-prompt cache hit rate — usually one of those two is the culprit.
The four gates above are the floor. For specific workloads we add more: refusal rate (what fraction of queries the model declines, and how that tracks against the adversarial eval set), citation accuracy (for grounded apps, see our grounded retrieval pipelines), format compliance (for structured-output apps, % of outputs that parse against the Pydantic / Zod schema first try). Add metrics only when the workload demands them — gate proliferation slows iteration without lifting quality.
LLM cost engineering — model routing, caching, batching.
Cost engineering is where most of the practical ROI on a production LLM build lives. Five levers cover ~90% of the savings we ship.
Model routing
A classifier (often the cheap-tier model itself) sends easy queries to GPT-5 mini or Haiku; hard queries to Sonnet or GPT-5. Most apps see 60–75% of traffic route to the cheap tier with quality holding. Typical cut: 40–70% of token spend.
Prompt caching
Anthropic prompt cache hits run 80–90% on stable system prompts; OpenAI's automatic cache catches the equivalent on chat completions. We design system prompts to maximise cacheable prefixes from day one. Typical cut: 70–85% of input-token cost on the cached prefix.
Batch API
For non-interactive workloads — overnight summarisation, batch classification, embedding regeneration — the OpenAI Batch API and Anthropic's batch tier knock ~50% off list at the cost of higher latency (up to 24h). Typical cut: 50% on batch-eligible traffic.
Context windowing
Trim the input. Summarise prior turns; retrieve only the chunks that score above a threshold; cap response length aggressively when the format doesn't need more. Boring engineering work that quietly cuts the per-call cost. Typical cut: 15–30% of token spend.
Self-host at break-even
Above ~1M requests/day on a single workload, self-hosted Llama 4 on vLLM with continuous batching usually flips the economics. The trigger isn't a slogan — it's where the dedicated A100 pool plus ops overhead crosses the hosted bill. Typical cut at break-even: 60–80% of hosted spend.
The order matters. Routing and caching together usually deliver 60–80% of the savings before any other lever fires. Batching catches the next chunk for the workloads that fit. Context windowing is the boring engineering work that compounds. Self-hosting is the heavy lever that only fires when volume justifies it — premature self-hosting is the most common cost-engineering mistake we see in audits. Our deeper take lives in our LLM cost engineering deep dive.
LLM observability — what we instrument.
Production LLM apps fail differently than other production systems. The traces have to capture cost, quality, and latency together — and feed back into the eval set every week.
We default to Langfuse as the trace store on most builds; Helicone when the team wants a thinner proxy-only setup; LangSmith when LangChain is already the orchestration layer. The tool matters less than what gets instrumented. Every production call captures: model name, prompt hash, cache hit / miss, input tokens, output tokens, cost, p95 latency contribution, eval-grader score (sampled), and a user / team / tenant tag for downstream cost attribution.
Production traces feed the eval set monthly. We sample ~1% of production calls into a review queue; the domain expert grades a subset; the graded examples land in the eval set as regression cases. This is what stops the slow drift that hits most LLM apps 90 days post-launch — a model upgrade or a prompt edit subtly changes the failure mode, and without sampled feedback you don't notice until users complain. Regression alarms fire on any 5-point drop on any of the four gates.
For high-stakes workloads we add a second layer: per-tenant cost budgets enforced at the proxy, per-user rate limits to catch runaway scripts, and an alert when the cache hit rate drops more than 10 points in a 24h window (usually the canary for a system-prompt edit that broke caching). About a quarter of LLM App Audits we run trace problems back to observability gaps — you can't fix what you can't see.
Fine-tuning — when it's worth it.
Llm fine tuning services live at a specific decision point: when prompting and retrieval don't cover the workload but a small domain dataset does. Most engagements don't need fine-tuning; the ones that do tend to be obvious.
We support three fine-tuning shapes. LoRA on open-weights bases (Llama 4, Mistral, Qwen) for domain-vocab and output-format problems — usually 1,500–10,000 examples, single A100 80GB, 8–24 hours of training time, ~$300–$1,200 per run. QLoRA for the same shapes when memory is constrained, with a small quality trade-off versus full LoRA on 70B models. Hosted fine-tunes on OpenAI or Anthropic when the base model has to stay frontier-class and the dataset is small enough to be cost-effective at hosted-fine-tune pricing.
The order is always the same. Baseline the prompt-only path against the eval set. If the baseline is good enough, ship it — about a third of fine-tuning engagements end here, and that's a successful outcome. If the baseline is not good enough, build the fine-tuning dataset (this is usually the bottleneck, not the training run), run the fine-tune, score against the same eval set. Deploy only if the fine-tune wins on the eval set with a margin that justifies the iteration cost. We've shipped engagements where the fine-tune lost by 1–2 points and we still recommended sticking with prompt-only because the iteration overhead wasn't worth the marginal quality.
A specific real example: a health-tech client's clinical-note structuring task. GPT-5 baseline scored F1 ~71 on the clinician-graded eval set. We benchmarked a QLoRA on Llama 4 70B trained on 8,400 redacted notes — F1 ~85, plus the data stayed in-cloud and per-call cost dropped 60%. The fine-tune shipped. Same shape, different client, finance domain: baseline scored 89, fine-tune scored 91. We didn't ship — the 2-point lift didn't justify the operational ownership of an adapter. Most fine-tuning decisions are this kind of close call; the eval set has to be the tiebreaker, not the wishful thinking.
Prompt engineering at scale.
Prompt engineering services on production LLM systems are mostly version control, eval correlation, and cost discipline — not clever phrasing. The cleverness happens in the eval set design.
Production prompts live in version control, not in a notebook or a CMS. Every prompt change ships with a commit message, a diff, and an eval-set re-run. If the eval scores drift, we know which prompt edit caused it — usually within a few hours of the change landing. The prompt versioning pattern we ship on most Production Builds keeps the active system prompt as a git artifact with a SHA pinned in the deploy config; rollback is a config change.
For prompt patterns themselves: structured-output enforcement via Pydantic or Zod schemas catches more than chain-of-thought tricks on most production workloads — the model can't drift if the output has to parse. Few-shot examples live in a separate corpus that the prompt builder injects at request time; this lets us A/B example sets without touching the system prompt. Caching prefixes are designed in from day one — anything stable goes at the top of the prompt; anything dynamic goes after, where it doesn't bust the cache. System-prompt isolation prevents user content from being interpreted as instructions; we wrap user input in unambiguous delimiters and tell the model so explicitly.
We ship our prompt engineering services as part of the Build, never as a standalone — prompting without eval is just creative writing. For teams that want the prompts handed off cleanly, every Production LLM Build ends with a documented prompt library plus a small prompt-versus-fine-tune decision memo covering when to escalate which knob next.
How a build runs — eval-first, every time.
The same six-step process runs across a 4-week Pilot and a 16-week Production Build. The gates change in depth, not in shape. Every step has a deliverable, a named owner, and a gate criterion — pass or rework.
Discovery
Workload shape, eval surface, cost target, residency posture. Models aren't picked yet — that's week 3.
Eval set
30–80 domain-expert-graded examples covering main paths and edge cases. Lands before any prompting.
Baseline
Three to four models scored against the eval set. Cost-adjusted quality wins, not benchmark theatre.
Iteration
Prompt, routing, retrieval (if RAG), or fine-tune (if the data justifies it). Each change re-scored.
Deploy
Auth, rate-limit, Langfuse observability, model fallback, cost guardrails, regression alarms.
Running
Weekly eval, drift alarms, prompt iteration log, model-upgrade regression checks. The eval set grows.
Hosted versus self-hosted — the side-by-side.
A reference table covering the practical trade-offs. The picker above gets you a recommendation; this table gives you the numbers to defend it in a procurement conversation.
| Hosted (provider) | Self-hosted (yours) | |
|---|---|---|
| Hosted (OpenAI / Anthropic / Vertex) | Fast to ship, top-of-class quality, per-token pricing — no inference infra to run | |
| For most teams below ~500k requests/day, hosted is strictly faster to ship and cheaper to operate. No GPU cluster to provision, no CUDA dependency hell, no on-call rotation for inference infra. The switching cost if you later need self-hosted is a model swap, not a rewrite. | ||
| Self-hosted (Llama / Mistral / Qwen on your cloud) | Data stays in your cloud, fixed infra cost, model under your control, custom quantization + adapters | |
| Self-hosting is the only option when a data-residency rule forbids third-party inference — common in EU healthcare (GDPR + national health-data law), defence-adjacent contracts, and financial services with ring-fenced data mandates. It's also the only path to QLoRA/LoRA fine-tunes that you own outright. | ||
| Best when | Quality > cost; no residency rule; team has no ops capacity | Regulated data; very high volume; or strategic need to own the model |
| Latency floor | Provider-dependent; rarely below 600ms TTFT for flagships | Sub-200ms TTFT achievable with vLLM continuous batching |
| vLLM's continuous batching scheduler saturates GPU memory instead of waiting for fixed batch boundaries — at typical QPS, p50 TTFT on a single A100 running Llama 3 70B sits around 130–180ms. Provider network round-trips alone add 80–150ms before a single token is returned; flagship models (GPT-5, Claude Opus) sit materially above 600ms TTFT at launch load. | ||
| Per-1K-tokens cost | $0.30–$15 input/output (flagships); $0.05–$0.50 (small models) | $0.05–$0.40 amortised on dedicated A100s once volume crosses the break-even |
| The ranges genuinely overlap. At low volume the amortised A100 cost looks great on paper but the fixed reservation cost ($10–25k/mo per A100 on-demand) dominates unless utilisation is high. At very high volume the per-token numbers flip — but the real lever is utilisation rate, not raw token price. | ||
| Operational load | Low — provider runs inference, you run the app | Higher — you (or we) run GPU pools, batching, quantization, autoscaling |
| Break-even point | <100k requests/day — hosted almost always wins | >1M requests/day on a single workload — self-hosted economics flip |
Where teams have shipped.
Six anonymised engagements across recent quarters. Workloads, segments, and outcome metrics are real; brand removed under NDA.
Internal LLM assistant on Claude + private corpus
Slack-deployed advisor that pulls from a redacted Confluence + Snowflake corpus. Refuses cleanly on out-of-corpus. Auth scoped per team; Langfuse traces every call; cost capped per-user per-day. The eval set is graded monthly by two senior analysts.
QLoRA fine-tune on Llama 4 for clinical note structuring
8,400 redacted clinical notes, QLoRA fine-tune on a single A100 80GB, eval against a clinician-graded reference set. Self-hosted on the client's AWS via vLLM. We benchmarked GPT-5, Sonnet, and the fine-tune; the fine-tune won on F1 and on per-request cost simultaneously.
Model routing + prompt caching on a GPT-5 stack
A small intent classifier routes ~70% of traffic to GPT-5 mini and ~30% to GPT-5. Anthropic-style prompt caching on the stable system prompt cut the cached portion by 85%. Eval gates guard the routing thresholds — quality on both tiers tracked weekly.
Voice intake on GPT-5 Realtime, sub-400ms p95
Patient intake voice agent on GPT-5 Realtime API with a custom safety filter pre-generation. Voice-RAG pulled from a HIPAA-compliant Postgres index running BGE embeddings on the same VPC. We sit at p95 ~360ms TTFT; clinical reviewer signed off on faithfulness at 96%.
Bill-of-lading extraction with GPT-5 Vision + structured output
Scanned BoLs across 14 carriers. GPT-5 Vision extracts to a Pydantic schema with structured output, validated against a rules engine, and posted into the TMS. Carrier-specific layouts get few-shot examples; the rare edge case routes to human review with the model's draft attached.
Qwen 3 72B for Chinese-first tutor copilot
Self-hosted Qwen 3 72B on a 4×A100 pool for a Chinese / English bilingual tutoring app. We benchmarked it against Llama 4 70B and GPT-5; Qwen won on Chinese fluency and cost, GPT-5 won on English. The app routes by detected language at the prompt layer.
LLM consulting services — advisory engagements.
Sometimes the right answer isn't "build the app." Our llm consulting services cover the strategic decisions that need to land before any code ships — and occasionally the engagement ends "don't build, buy this off-the-shelf tool instead."
Model selection audit
Two-week engagement: workload audit, eval-set design, head-to-head benchmark of 3–5 candidate models, build-vs-buy-vs-fine-tune memo.
LLM cost projection
TCO model over 12 months: hosted vs self-hosted, routing scenarios, prompt-cache assumptions, traffic-growth sensitivities. Lands in 1–2 weeks.
Provider roadmap
OpenAI vs Anthropic vs Vertex vs open-weights, with a procurement + risk lens. Useful when leadership needs the trade-offs on paper.
Build-vs-buy assessment
Custom app vs SaaS vs hybrid. Honest stop-recommendation when an off-the-shelf tool covers your workload — that's about 1 in 5 consulting engagements.
Llm consulting is what we run when the question is "should we build this?" not "how do we build this?" — a 1–2 week engagement with a written memo, sometimes a benchmark, sometimes a TCO model. About one in five consulting engagements ends with "don't build, here's the SaaS that does this," which is a successful outcome we sometimes have to talk new clients into believing we mean. For broader strategic AI work that spans LLM + RAG + agents, see our strategy advisory practice.
Four ways to start.
Pilot one workload, intake to live.
- One scoped use case
- Eval set (30–80 examples)
- 3–4 model baseline scored
- Working prototype
- Demo + build-or-stop memo
- Production deploy
- Fine-tuning
- Multi-workload scope
Full LLM app with eval gates.
- All Pilot deliverables
- Auth · rate-limit · observability (Langfuse)
- Model routing + cost guardrails
- Eval gates baked into the deploy pipeline
- Adversarial eval + safety classifier
- Four weeks of post-launch iteration
LoRA / QLoRA / hosted fine-tune.
- Dataset curation + cleaning
- Eval set + prompt-only baseline
- Fine-tune run + eval comparison
- Deploy if it wins; recommend prompt-only if not
- Weights + adapters + runbook transferred
Eval, cost, latency, safety review.
- Coverage audit on current eval
- Cost + latency profiling
- Adversarial / jailbreak test
- Prioritised fix-list + follow-on quote
Common LLM development questions.
Hosted (Claude / GPT) or self-hosted (Llama) — how do we decide?
Default to hosted unless you have one of three triggers: a residency rule on the data, very high steady-state volume (above ~1M requests/day on a single workload), or a strategic need to own the model weights. Self-hosting is a meaningful operational commitment — GPU pools, autoscaling, quantization tuning, batching — and we'll only recommend it when the math actually works.
The interactive picker above walks through the decision in 2–3 questions. Most clients we see end up on hosted with an Anthropic Sonnet baseline and GPT-5 mini as the cheap-tier in a routed stack; about a quarter end up self-hosted (mostly healthcare and finance with strict residency). Hybrid is rare but real for clients with mixed regulated and non-regulated traffic. Our piece on hosted vs self-hosted LLMs covers the break-even math in detail.
When does fine-tuning beat prompt engineering and RAG?
Fine-tuning wins on three specific shapes. One: when the output format is so consistent that prompting struggles (a custom JSON schema, a regulatory document format, code in a niche DSL). Two: when the domain language has terms the base model fumbles — clinical jargon, legal idioms, internal product vocabulary that the base model never saw enough of. Three: when latency or cost have to drop and you've exhausted the prompting and routing levers.
Prompt + RAG usually wins outside those three shapes. Fine-tunes are slower to iterate on, harder to debug, and they don't compose with citations the way RAG does. We baseline prompt-only and prompt+RAG before we propose a fine-tune; if either of those clears the eval bar, the fine-tune budget gets spent on the eval set instead. Our llm fine tuning services live inside the Fine-tuning engagement shape above — about a third end in "the fine-tune didn't beat the baseline, so we shipped prompt-only," and that's a win, not a loss.
Composing them is common too: RAG for facts, a small LoRA fine-tune for output style and domain vocab. We scope both at week 2 if the use case has both shapes. The picker on our grounded retrieval pipelines page covers the RAG vs fine-tune decision tree in more depth.
How do you measure LLM application quality?
Four metrics, scored separately because they fail differently:
- Task score — app-specific grader on a domain-expert-graded eval set (30–80 examples). The eval set is the most important deliverable of the engagement; your domain expert grades, we facilitate.
- Hallucination rate — claims with no factual support, scored by LLM-as-judge (Claude Sonnet 4.6) with human spot-check on the disputed cases. Hard gate before production.
- P95 TTFT — full time to first token including any pre-generation safety filter. Voice apps target sub-400ms; chat apps sub-800ms.
- P50 cost per request — modelled at discovery, tracked weekly post-launch via Langfuse. Surprise bills aren't a surprise.
The default eval stack is Inspect AI as the harness, RAGAS for any retrieval-adjacent metric, custom graders for app-specific scoring, and Langfuse for production trace sampling that feeds the eval set monthly. Regression alarms fire when any metric drops more than 5 points on a model upgrade. Our piece on eval framework comparison covers when to reach for which tool.
How do you handle prompt injection and jailbreaks?
Defence in depth. Input classifier (Llama Guard or a custom classifier) runs pre-generation on user input; structured-output enforcement constrains what the model can return; system-prompt isolation prevents user content from leaking into instructions; output filter classifies pre-response. Every production deploy ships with a documented threat model and an adversarial eval set — 30–80 known jailbreak attempts plus domain-specific abuse patterns.
We don't claim "unbreakable" because that's not a real thing. We claim measured — we know our refusal rate on the adversarial set, we know our false-positive rate on the benign set, and we know the trade-off we're holding. For high-stakes workloads (clinical, financial advice, legal) we add a second classifier post-generation and a human-in-the-loop for refused-but-disputed cases. Our customer-side budget for safety work is usually 10–15% of the build effort; trying to do this cheaper is the cheap way to ship the wrong thing.
What about cost runaway — how do you prevent surprise bills?
Five mechanisms layered together. Per-request cost budgets: hard caps that refuse the request rather than emit it when the prompt is bigger than expected. Rate limits per user, per team, per endpoint — usually configured via Langfuse or a thin proxy. Model routing: easy queries to GPT-5 mini or Haiku, hard queries to flagships, controlled by a small classifier. Prompt caching on stable system prompts and any prefix that repeats — Anthropic's cache hits often run 80–90% on enterprise system prompts. Batch API for non-interactive workloads, which knocks ~50% off list at the price of higher latency.
Cost is modelled in week 3 of any engagement using the actual expected traffic shape, not a marketing average. Once live, weekly cost reviews catch drift before it compounds; the regression alarm trips on a 25% drift over two weeks. We've seen "surprise" bills usually trace to one of three things — a router threshold miscalibration, a cache hit rate that quietly collapsed after a system-prompt edit, or a runaway agent loop that didn't have a step-budget. All three are observability-detectable if you instrument the right things on day one.
Can you migrate us from one model provider to another?
Yes. Provider migration is part of about a quarter of our LLM App Audit engagements. The pattern is dual-write first (calls go to both providers for a sample of traffic), eval parity checks against your real eval set, then read-cutover with a fast rollback path. We've done OpenAI → Anthropic, Anthropic → Vertex, and hosted → self-hosted migrations with zero downtime when the abstraction was clean.
The abstraction matters more than the migration. We bake a thin provider-abstraction layer into every Production Build for exactly this reason — model swaps shouldn't require touching app code. If you're locked into a vendor SDK now, the migration scope grows because we have to abstract first, then migrate; that's a 6–10 week engagement instead of 3–4. Either way, the eval set decides whether the migration ships, not vendor benchmarks.
Who owns the prompts, fine-tuned weights, and the eval set?
You do. All artifacts transfer into your repository under the SOW: system prompts, few-shot examples, eval set, fine-tuned adapter weights, the Langfuse instance configuration, the runbook. We retain no rights to your prompts, weights, or data. Paiteq keeps engineering learnings — patterns, methodologies, anonymised case-study takeaways for our internal playbook — but never your specific artifacts.
This matters more than people realise on a first build. We've onboarded several clients whose previous vendor "owned" the prompts as IP, which made provider migration impossibly expensive. We refuse that pattern. Your business knowledge lives in the eval set and the prompts; treating either as vendor IP would be malpractice.
What's a realistic budget and timeline for production LLM development services?
The four engagement shapes above are fixed-scope and fixed-duration; we hold the price band on the contact call because workload depth, residency posture, and integration count swing the budget meaningfully. Rough order of magnitude:
- LLM App Pilot (2–4 weeks): small enough that stopping at the pilot is a real option. About one in three pilots end at the pilot because the eval surface wasn't measurable or the workload turned out to be a generation problem better served by a generative AI engagement.
- Production LLM Build (8–16 weeks): the bulk of our llm development services revenue. Includes four weeks of post-launch iteration baked into the SOW.
- Fine-tuning engagement (6–10 weeks): includes the head-to-head against the prompt-only baseline. We've shipped engagements that ended "the fine-tune didn't beat the baseline, so we shipped prompt-only" — that's a successful outcome.
- LLM App Audit (2–3 weeks): outputs a prioritised fix-list and a fixed-scope follow-on quote if you want us to ship the fixes.
For llm consulting services (model selection audits, cost projection, provider roadmap), 1–2 week engagements at a flat fee. The full breakdown lives in our strategy advisory practice for cross-service strategic work.
Let's ship the LLM app.
Pilot in 2–4 weeks. Production build in 8–16. Fine-tune in 6–10. Audit in 2–3.