Machine learning development services + custom ML development — calibrated, drift-monitored, owned by you
Custom ML development from a machine learning development company that ships gradient boosting on tabular, PyTorch on vision, hierarchical forecasting at SKU scale, and two-stage ranking. Every model graded on calibration, slice-fairness, and drift — not just headline AUC.
Classical ML vs LLM — when each one wins.
The 2026 mistake we see most often is teams reaching for an LLM where a 50-millisecond LightGBM model would be more accurate, more explainable, and roughly a hundred times cheaper. Custom ML development is still the right answer for most business-prediction problems — fraud, churn, credit, demand, ranking. The two families don't compete head-to-head; they win on different shapes of input and output.
| LLM-shaped problem | Machine-learning-shaped problem | |
|---|---|---|
| What's predicted | Free-form text, tokens, generated content | A number, a class, a rank, a probability |
| Training data | Trillions of internet tokens (pre-trained) | Your labelled rows, 10k to 100M of them |
| Eval signal | Human-graded rubric, RAGAS, LLM-as-judge | AUC, RMSE, NDCG, calibration, lift |
| Classical ML wins on eval rigour. AUC, RMSE, and calibration curves are deterministic — run the same held-out set twice and you get the same number. LLM eval relies on human graders or an LLM-as-judge, both of which introduce variance and cost. For regulated decisions (credit, fraud, medical triage) that measurability is often a compliance requirement, not just a preference. | ||
| Latency floor | 200–2,000ms per call (frontier hosted) | 1–50ms per inference (XGBoost on CPU) |
| Unit economics | $/M tokens — scales with output length | Pennies per million inferences amortised |
| At production volume, the gap is 100×–1,000× per call. A LightGBM model serving 10M inferences/day on a single CPU node costs roughly $30–50/month in compute. The equivalent call volume through a frontier-hosted LLM costs $3,000–$15,000/month depending on output length. Classical ML doesn't win on capability — it wins on the unit-economics of high-volume, low-complexity decisions. | ||
| Failure mode | Hallucinated facts, off-brand drift | Calibration drift, label leak, concept drift |
| Counter-intuitively, LLM failures are easier to catch. A hallucinated output is visible — a customer escalates, a reviewer flags it, a monitor catches the token. Classical ML failure is often silent: the fraud model's AUC holds at 0.87 while the population it was trained on has drifted six months. The model is still scoring, the dashboard is still green, and the business outcome is quietly degrading. That's why we instrument PSI and calibration checks from day one. | ||
| When it wins | Unstructured input, generation, reasoning | Tabular signal, ranking, forecasting, risk |
| This row is the one that gets teams in trouble. The categories aren't competing — they describe different input shapes and output contracts. The 2026 mistake is defaulting to an LLM because it's the obvious tool, then discovering on week six that a gradient-boosted model trained on your warehouse data is 3× more accurate and explainable on demand. We do the framing call before scoping so the wrong tool doesn't cost you a quarter. | ||
We won't sell you a classical ML build for a generative problem, and we won't sell you an LLM engagement for a problem a gradient boosting model would solve in week three. The framing call is free.
Four engagement shapes, fixed-scope.
Pilot, Production Build, Ranking-or-Recommendation, or an ML Audit. Every machine learning development services engagement maps to one of the four. Mixed engagements bill as two consecutive shapes, not an open-ended retainer.
Four ML problem families. We pick on data shape, latency, and explainability.
Tabular-classify, forecasting, rank-or-recommend, and deep-vision cover roughly 95% of our custom ml development work. Framework, eval signal, deployment posture, and unit economics differ per family. About 60% of engagements start tabular; a third are forecasting or ranking; the rest deep or hybrid.
TABULAR-CLASSIFY
The biggest revenue line in classical ML in 2026 — gradient boosting on tabular features still wins most business-prediction problems. XGBoost, LightGBM, or CatBoost as the model; Platt or isotonic calibration; SHAP for the explanation surface compliance always asks for. About 35% of our custom ml development engagements.
- Tabular features (counts, ratios, categoricals)
- Sub-50ms inference budget on CPU
- Need explanation per prediction (SHAP, regulator-readable)
- Class imbalance manageable with weighting or SMOTE
- 50k+ labelled rows already in your warehouse
- Unstructured input (text, image, audio) — neural nets or LLMs win
- Very low data regimes under 5k rows — classical statistics may beat ML
- Streaming features with sub-1ms latency — feature-engineering overhead kills it
FORECASTING
Forecasting services is a quietly large category — demand planning, supply chain, financial close, capacity. 2026 stack is mixed: Prophet for the explainable baseline, LightGBM-on-lags for the production workhorse, TimeGPT or Chronos for the foundation-model option when you have enough series. We default to LightGBM-on-lags — beats Prophet on accuracy, beats TimeGPT on cost.
- Time-series ml workloads with seasonality and lag structure
- Need explanation of the forecast — finance and ops teams won't sign off on a black box
- Mixed-frequency or hierarchical (per-SKU per-region) — gradient boosting handles natively
- Budget allows offline training, online serving cheap
- Single short series under 200 observations — classical ARIMA or exponential smoothing wins
- Pure anomaly detection — different family of models
- Cold-start product where there's no history — forecasting is the wrong frame
RANK & RECOMMEND
Ranking and recommendation systems development share a two-stage architecture — fast candidate generation (embeddings + ANN or co-occurrence pulls 200–2,000 candidates), then a heavyweight cross-encoder or gradient-boosted re-ranker scores the shortlist. Eval signal: NDCG@k offline plus a CTR or conversion A/B test online. We ship this for e-commerce catalogues, content feeds, and internal-search systems.
- Catalogue size over ~10k items where exhaustive scoring is too slow
- Ranking-quality matters — NDCG, MAP, MRR are tracked
- You have both interaction logs and item metadata
- Re-ranking budget allows 10–100ms cross-encoder per query
- Tiny catalogues under 200 items — a rule-based ranker is fine
- Pure cold-start with no logs — content-based heuristics first, ML later
- You can't run an online A/B test — offline NDCG alone is a noisy signal
DEEP / VISION
The narrowest slice — production-grade vision or signal models trained on customer data. Defect detection on a factory line, OCR on a custom document type, sensor classification on industrial telemetry. PyTorch as the framework, timm for vision backbones, an open-weight base (ConvNeXt, EVA, or a vision-LLM as feature extractor) fine-tuned on labelled customer data. Smaller than tabular for us, but engagements run longer and the IP compounds more.
- Image, signal, or sensor input where pixels/samples carry the signal
- 5k–500k labelled examples on the target task
- Latency budget allows 50–500ms on GPU or 10–50ms on a quantised CPU runtime
- Use case has a clear failure-cost — defect, fraud, safety — that justifies the build
- OCR on standard document types — hosted Vision-LLMs (GPT-5 Vision, Claude) are cheaper and beat you on quality
- Cold-start without any labelled data — synthetic data and active learning first, model second
- General-purpose recognition — pretrained zero-shot is good enough
Six model families. We pick per data shape and per latency budget.
No house model. We benchmark gradient boosting, neural nets, statistical baselines, and the vision-LLM-as-encoder pattern against the same held-out set, and ship the model that beats baseline by the agreed margin without breaking calibration. Roughly six in ten predictive analytics services we ship end on LightGBM or XGBoost.
The default winner on tabular data in 2026 — still beats deep learning on most business-prediction tasks. LightGBM for speed at scale, XGBoost for the broader SHAP/ONNX ecosystem, CatBoost for messy categoricals without one-hot encoding. Sub-millisecond inference on CPU. Platt and isotonic calibration mature.
About 60% of predictive analytics services we ship lead with LightGBM or XGBoost. Default for churn, fraud, credit risk, conversion, hierarchical demand forecast, ranking re-rankers. Anywhere SHAP explainability is a compliance requirement.
Unstructured input. Sub-millisecond streaming features where loading the model is the bottleneck. Tasks under 2,000 rows where a logistic regression with strong features is more honest.
We hand off the booster file, an inference shim, a SHAP explainer, and a calibration sidecar — small enough to deploy on the existing service stack, no GPU required.
The 2026 default for neural-net work that isn't an LLM. PyTorch 2.x core, Lightning for training-loop scaffolding, timm for vision backbones. Inference via ONNX Runtime or TensorRT on GPU; quantised CPU via Intel Neural Compressor or AWS Neuron for edge.
Image, audio, signal, or long-sequence input the LLM doesn't handle economically. Custom embeddings for a recommender. Fine-tuning a vision backbone on customer data. Custom encoders for a downstream classical head.
Tabular workloads — gradient boosting beats you on accuracy and cost. Pure LLM workloads — that's <a href="/services/llm-development/">our LLM development services</a> sibling.
We've shipped PyTorch-based defect detection on a factory line, an OCR encoder for a custom document type, and the content-tower of a two-tower recommender. ONNX Runtime for inference — easier to deploy than a Python service.
The honest baseline that beats half the deep-learning press releases. Linear and logistic regression with proper feature engineering, regularised regression (ridge, lasso, elastic). statsmodels for ARIMA, exponential smoothing, Holt-Winters. Cheap, explainable, calibration usually better out of the box than tree ensembles.
Datasets under 5,000 rows, constrained deployment (mobile, on-prem no-GPU), or regulators that care about explanation. Always as the baseline against which we compare LightGBM and PyTorch.
Large datasets where regularisation can't capture non-linear structure. Tasks where feature engineering takes longer than fitting LightGBM.
Every engagement starts with a scikit-learn baseline. About one in five ends up shipping the baseline as the production model — usually risk scoring where the regulator wins.
Three families cover most forecasting services work. Prophet is the explainable baseline with built-in seasonality. LightGBM-on-lags (engineered lag features into gradient boosting) is the workhorse — beats Prophet on accuracy at most scales. Nixtla's TimeGPT and Amazon Chronos are the foundation-model option when you have hundreds of series.
Demand forecasting at SKU scale, financial close, capacity planning, claims volume. LightGBM-on-lags 60% of the time; Prophet 25% for explainability; TimeGPT or Chronos 15% when series count justifies the cost.
Single short series under 200 observations — ARIMA wins. Pure anomaly detection — different family. High-frequency tick data — specialist stack we route out.
We shipped a hierarchical demand-forecast across 12,000 SKU-region pairs on LightGBM-on-lags that cut a retail client's holding-cost overrun by 18% in a quarter. Foundation-model forecasting was tested and lost on cost.
The candidate-generation half of modern ranking and recommendation systems development. Embeddings from a custom two-tower, a pre-trained sentence encoder, or a vision encoder. ANN indexes (Faiss self-hosted, ScaNN on Google, pgvector for Postgres-native) pull 200–2,000 candidates under 10ms.
Catalogue-scale ranking and recommender systems. Semantic search where keyword search misses. User-similarity lookups. Anywhere brute-force scoring blows the latency budget.
Catalogues under a thousand items — exhaustive scoring is faster than building an ANN index. RAG retrieval — see <a href="/services/rag-development/">our RAG development services</a>.
Faiss self-hosted; pgvector when data already lives in Postgres and ops capacity is thin; ScaNN on Google Cloud. Two-tower trained in PyTorch, exported to ONNX, served behind the ANN index.
A 2026 pattern that didn't exist three years ago — use a frontier vision-LLM as feature extractor on image or document input, then put a classical model on the embeddings. GPT-5 Vision and Claude Sonnet 4.6 both expose embedding endpoints; Gemini 3.0 Pro has the cleanest multimodal one. Cuts the labelled-data requirement by an order of magnitude.
Image classification with 500–5,000 labels instead of 50,000. OCR-grade document classification where structure carries the signal. Sentiment plus extraction on screenshots, charts, scanned forms.
Latency-tight on-device workloads — the vision-LLM call kills the budget. Regulated workloads where data can't leave the perimeter — fine-tune a self-hosted vision backbone instead.
We pair this with our <a href="/services/generative-ai/">generative AI practice</a> regularly — vision-LLM upstream as encoder, classical model downstream as predictor. Halves the labelling spend on most image-classification builds.
The ml model development pipeline we ship — four phases, eval-gated.
The model is roughly a fifth of a real ml model development engagement. The other four-fifths is the data audit, feature pipeline, calibration layer, and drift instrumentation. Skipping any is how machine learning solutions silently fail at month six — the model that beat baseline on day one is now mis-calibrated on a shifted population.
- 01
Data audit + leakage map
Label quality, sample-selection bias, leakage paths, feature availability at inference time. Roughly half the engagements we audit have a leak we close before training — usually a feature only computable post-event. We document every join and time-of-availability per feature; the production system can only train on features it can compute at decision time.
- 02
Baseline + feature engineering
scikit-learn baseline always — logistic or linear with regularisation, fit on a starter feature set. The candidate has to beat the baseline by the agreed margin or we don't ship it. Feature engineering layered after: aggregations, lags, target encodings, learned embeddings. The pipeline ships with the model as a single deliverable.
- 03
Bench + calibration
LightGBM, XGBoost, scikit-learn baseline, and where relevant a PyTorch deep model bench-raced on the frozen held-out set. Hyperparameters via Optuna or a structured grid — not random; the search log is kept. Calibration fitted as a Platt or isotonic sidecar. The model that ships isn't the one with the highest raw AUC; it's the one that's well-ordered, well-calibrated, and fair across the slices the client cares about.
- 04
Deploy + drift instrumentation
FastAPI or BentoML serving on the runtime appropriate to latency budget; ONNX or pickle for the artefact. Production Stability Index per feature on a 30-day rolling window; alerting at 0.2, pause-decisioning at 0.5. Weekly calibration check on production data. Runbook names trigger conditions, on-call rota, and retrain procedure in writing — drift detection is not a mental checklist on a single engineer's laptop.
Retrain cadence ships in the runbook — quarterly default, monthly when drift is consistent. The client's team runs the retrain script after handoff; we retainer the operation only when asked.
Four gates. Every model. Every week.
Custom machine learning models without these four gates drift silently. Headline AUC stays clean while the business outcome degrades. Every model we hand off carries the gates as a contract — trigger conditions that fire a retrain included.
- 01 Held-out AUC / RMSE / NDCGBeats baseline by ≥3pts
Frozen test set carved out at engagement start. AUC for binary classification, RMSE or MAPE for regression and forecasting, NDCG@k for ranking. Baseline is whichever non-ML rule the client already runs. If the candidate can't beat baseline by three points on the headline metric, we don't ship. Roughly one engagement in eight ends up shipping the baseline because the ML model couldn't clear the bar.
If the candidate doesn't beat baseline, we don't paper over it — we re-frame, harvest more labels, or close at the Pilot gate and bill against the audit memo. Confident-but-wrong ML is worse than no ML.
- 02 Calibration errorBrier < 0.18, ECE < 4%
Calibration matters more than raw AUC in most business-prediction problems — a churn score that's well-ordered but mis-calibrated breaks every downstream business rule. We measure Brier plus ECE on the held-out set, fit a Platt or isotonic sidecar where needed, and re-measure. Re-checked weekly on production data.
If calibration drifts above the threshold for two weeks, we re-fit the sidecar before re-fitting the model. Most drift comes from population shift, not model decay — the calibration layer is the right place to absorb it.
- 03 Concept-drift detectionPSI < 0.2 per feature
Production Stability Index per feature on a rolling 30-day window. Above 0.2 the feature has drifted enough to inspect; above 0.5 the model is outside the training envelope. Paired with output-distribution monitoring. Tracked in Evidently or a Postgres dashboard, depending on the client's MLOps capacity.
Any feature breaching 0.5 PSI fires a Slack alert and pauses automated decisioning if the use case warrants it (fraud, credit, healthcare). Retraining usually quarterly, monthly when drift is consistent. Trigger conditions documented in the runbook.
- 04 Fairness · slice metrics≤5pt AUC gap across slices
AUC and calibration measured separately on the slices that matter — region, segment, protected class where legally relevant. Headline AUC can look fine while the worst slice is unusable. We surface the gap during model selection and discuss the tradeoff openly — sometimes the cheaper, fairer model wins.
Above-threshold gaps trigger a re-weighting pass or a slice-specific model. We don't ship where the fairness story is uncomfortable; if the data won't support a fair model, that's a finding we surface in writing.
Six capability families across six industries — where we've shipped.
A capability-by-industry heatgrid. Strength reflects what we've taken to production, not what we've explored. The light cells are honest — we won't claim depth we haven't built.
When ML is the answer — and when it isn't.
The most expensive failure mode here is shipping ML where a 20-line rule would have done the job. The second-most is the inverse — running a hand-tuned heuristic two years past the point a calibrated gradient boosting model would have doubled the outcome. The list below is the screen we run on every inbound.
-
01 You have labels and a baseline
Labelled outcome rows (5k+ for tabular, 500+ for vision-LLM-encoder, 50k+ for deep vision from scratch) and a non-ML rule already running. ML's value is whatever it adds on top of the baseline — without one, the lift number is just a vibes-check.
-
02 The decision is repetitive and high-volume
Pricing one row at a time at hundreds of millions per day. Ranking a catalogue at every page view. Scoring a payment in 40ms. ML amortises the engineering cost across the volume — a single decision a quarter doesn't justify the build.
-
03 The cost of wrong is measurable
Fraud losses, missed revenue, holding-cost overrun, false positives in a regulated process. ML works when you can put a number on the failure mode — that's how we tune the threshold and prioritise the slices.
-
04 You can run an A/B test
Offline NDCG or AUC is a noisy signal without an online check. If your environment doesn't support a controlled rollout, the engagement gets riskier — sometimes the right answer is to fix the experimentation pipeline first.
-
05 It's a generative or reasoning problem — wrong family
Free-form text out, multi-step reasoning, document understanding with no clean label per row. That's LLM territory, not classical ML. We'll route you to our LLM development services sibling and run that engagement instead.
-
06 The data isn't there yet — wrong stage
Cold-start product, no logs, no labels, no baseline. ML is the wrong investment; the right one is a heuristic plus a measurement plan that builds the dataset for an ML build six months out. We'll say so in the audit memo.
If the screen lands clean on four of six, custom machine learning models are usually the right shape. Two or fewer, we'll often recommend something else — sometimes our AI consulting shape, sometimes a heuristic-plus-measurement plan, sometimes nothing.
Eval-first, baseline-anchored, eight weeks to a calibrated model.
Metric and baseline land in week one — locked before training begins. Every model we ship is graded against the same frozen held-out set; nothing slides because the team got attached to a result. That's the difference between machine learning solutions that talk about quality and ones that measure it.
Problem framing
Predict what, for whom, against which baseline. The eval metric is locked in week one — AUC, NDCG, RMSE, calibration band. The baseline is the non-ML rule already running.
Data audit
Label quality, leakage paths, sample-selection bias, class balance, feature availability at inference time. Half the engagements we audit have a leak we have to close before training begins.
Baseline + features
scikit-learn baseline always. Then engineered features — aggregations, lags, target encodings, embeddings. The feature pipeline is part of the deliverable; nothing trains on features the production system can't compute.
Training + eval
Candidate models bench-raced against the baseline on the frozen held-out set. LightGBM, XGBoost, scikit-learn baseline, and where relevant a PyTorch deep model. Hyperparameters via Optuna or a structured grid.
Calibration + slices
Brier, ECE, slice-AUC, calibration sidecar. Fairness review across the slices that matter. The model that ships isn't the one with the highest raw AUC — it's the one that's well-ordered, well-calibrated, and fair across the slices the client cares about.
Deploy + monitor
Serving layer (FastAPI, BentoML, or the client's existing pattern), ONNX or pickle artifact, feature-pipeline runbook, drift monitoring on PSI, weekly calibration check. Handoff to the client's ml model development team or our MLOps sibling.
What the eight-week Production ML Build looks like.
The standard custom ml development build — a defined slice ships in eight weeks. Ranking-or-Recommendation adds 2 weeks for the two-stage A/B harness; the Pilot is a tighter 3–5 week cut.
Locked metric, locked baseline, eval-set plan, data-access list
Metric + baseline sign-off
Leakage map, label-quality report, feature-availability matrix
Audit findings reviewed
scikit-learn baseline; engineered feature pipeline v1
Baseline > non-ML rule
LightGBM / PyTorch / linear bench; held-out scores
Best model ≥ baseline + 3pts
Platt or isotonic sidecar; slice-AUC report
ECE < 4%, slice gap ≤ 5pts
Serving layer, drift dashboard, runbook, retrain cadence
First 30d of clean traces
Frameworks we've shipped on.
Pinned to what we have in production in 2026. The actual integrations under support — not a marketing list.
- XGBoost
- LightGBM
- CatBoost
- scikit-learn
- PyTorch
- Lightning
- timm
- ONNX Runtime
- Faiss
- ScaNN
- pgvector
- Prophet
- TimeGPT
- Evidently
- MLflow
- Weights & Biases
- XGBoost
- LightGBM
- CatBoost
- scikit-learn
- PyTorch
- Lightning
- timm
- ONNX Runtime
- Faiss
- ScaNN
- pgvector
- Prophet
- TimeGPT
- Evidently
- MLflow
- Weights & Biases
Where teams have shipped.
Three anonymized engagements. Function, segment, and outcome metric are real; brand removed under NDA.
Hierarchical demand forecast across thousands of SKU-region pairs
Typical shape: replace a Prophet-per-SKU pipeline with one LightGBM-on-lags model carrying hierarchical features (category, region, promo cadence). Training cost compresses materially; headline categories typically gain meaningful MAPE points against the prior baseline. Re-trained weekly on Modal. Pairs with the client's ERP for the planning loop.
Calibrated credit risk model with SHAP-led explanation
Typical shape: gradient boosting on application + bureau features, isotonic calibration sidecar, SHAP-based per-application reason codes. Slice-AUC reviewed across regional and demographic cuts before sign-off with the regulator-facing risk lead. Replaces drifted logistic scorecards. Live with quarterly retrain.
Cross-encoder ranking model on routing recommendations
Typical shape: two-stage — Faiss candidate-gen on driver-route embeddings, LightGBM ranker as cross-encoder. NDCG@10 measured offline on a frozen log; CTR measured online in an A/B test. Replaces hand-tuned heuristics that have been the dispatcher-productivity bottleneck for years.
Why teams pick us as their machine learning development services partner.
-
01 Baseline-anchored or we don't ship
Every candidate model fights a scikit-learn baseline on the frozen held-out set. If it doesn't beat by the agreed margin, the engagement ends at the audit memo, not at a ship-it-anyway compromise. About one engagement in eight closes here, and the client gets a heuristic-plus-measurement plan instead.
-
02 Calibrated, not just AUC-clean
Headline AUC without calibration is a setup for downstream failures. We measure Brier, expected calibration error, and slice-AUC before the deploy — and instrument drift on every feature in production. The model that ships is the one that holds up at month six, not the one that wins on day one.
-
03 You own the artefact
The booster file, the ONNX export, the feature pipeline, the calibration sidecar, the eval harness, the runbook — yours, in your repo, under your license. We don't retain joint IP on customer-trained models. Retainer the operation if you want; the model is owned by you regardless.
Four ways to start.
One problem, one model, one eval set.
- Problem framing + locked metric
- scikit-learn baseline
- Candidate model bench against baseline
- Held-out eval report + costed go-no-go memo
- Production deployment
- Drift instrumentation
- Ongoing retraining (separate Build)
Calibrated, drift-monitored, owned by you.
- All Pilot deliverables
- Feature pipeline in your repo
- Calibration sidecar
- Drift instrumentation + runbook
- Six weeks of post-launch iteration
Two-stage candidate-gen + re-rank.
- Candidate-generation index (Faiss / ScaNN / pgvector)
- Cross-encoder re-ranker training
- Offline NDCG harness
- Online A/B test rig wired
Read of practice + costed roadmap.
- Data audit, eval-rigour audit, deployment-posture read
- Leakage and drift findings in writing
- Costed roadmap memo for the next 6–12 months
What buyers ask before signing.
When do we pick classical ML over an LLM?
The short answer — whenever the inputs are tabular rows and the output is a number, a class, or a ranked list. Gradient boosting on engineered features still beats every LLM-shaped solution we've benchmarked for those problems at roughly a thousandth of the inference cost. The 2026 mistake we see most often is teams reaching for an LLM where a 50-millisecond LightGBM model would be more accurate and roughly 100× cheaper. LLMs win when input is unstructured and output is generative or reasoning-shaped. The hybrid pattern — vision-LLM as feature extractor, classical model as predictor — is the third option, and it's where this practice meets our LLM development services.
How much labelled data do we actually need for a custom ML build?
Depends on the problem family and the model. For tabular gradient boosting on a business-prediction problem, 5,000–20,000 labelled rows is usually enough to beat a heuristic by a defensible margin; 100,000+ is where the model gets sharp. Deep vision from scratch typically needs 10,000–500,000 examples. The vision-LLM-as-encoder pattern collapses that to 500–5,000 — the single biggest unlock in custom ml development since transformers landed. If you don't have labels and can't get them cheaply, we'll say so during the audit. Sometimes the answer is active learning, sometimes synthetic data, sometimes a heuristic-plus-measurement plan that gets you to ML in six months.
Who owns the model after delivery — Paiteq or the client?
The client. Every machine learning development services engagement ends with the model artefact (booster file, ONNX export, PyTorch checkpoint), the training code, the feature pipeline, the eval harness, and a runbook in your repo under your license. We don't retain joint IP on customer-trained models — the trained weights, the engineered features, the calibration sidecar are yours. What we retain is methodology: build templates, eval rubrics, audit playbooks. Operate the model on a retainer if you want; the artefact stays yours regardless.
How do you handle drift, calibration, and retraining cadence?
Three layers. Production Stability Index per feature on a rolling 30-day window — 0.2 we inspect, 0.5 pauses automated decisioning if the use case requires it. Calibration check on a production sample every week — Brier and expected calibration error against the band agreed at launch. Scheduled retraining cadence, usually quarterly, pulled to monthly when drift is consistent. The runbook names the trigger conditions in writing; alerts route to a named on-call rota; the retrain procedure is a script the client's team can run without us. Drift detection is engineered, not a mental checklist.
How is this different from your MLOps service and your LLM service?
Three siblings, three different jobs. Machine learning development services here is about building the model — framing, data audit, features, training, eval, calibration. MLOps services is the infrastructure around the model after ship — serving, monitoring, retraining, feature stores. LLM development services is the equivalent practice for large language models — different family, different eval, different unit economics. An engagement can span more than one; we scope each shape separately so the deliverables are unambiguous.
Adjacent services.
Ship a calibrated model in 8 weeks.
Pilot in 3–5. Production Build in 8–14. Ranking / Recommendation in 6–10. ML Audit in 2–3.