P6 · Services

Machine learning development services + custom ML development, calibrated, drift-monitored, owned by you

Custom ML development from a machine learning development company that ships gradient boosting on tabular, PyTorch on vision, hierarchical forecasting at SKU scale, and two-stage ranking. Every model graded on calibration, slice-fairness, and drift, not just headline AUC.

Talk to engineering See engagement shapes

Practice Custom ML development

Stack XGBoost · LightGBM · PyTorch · Faiss

Eval AUC · NDCG · Brier · PSI

Engagements 3–14 weeks · fixed scope

001 / WHEN IT WINS

Classical ML vs LLM, when each one wins.

The 2026 mistake we see most often is teams reaching for an LLM where a 50-millisecond LightGBM model would be more accurate, more explainable, and roughly a hundred times cheaper. Custom ML development is still the right answer for most business-prediction problems, fraud, churn, credit, demand, ranking. The two families don't compete head-to-head; they win on different shapes of input and output.

	LLM-shaped problem	Machine-learning-shaped problem
What's predicted	Free-form text, tokens, generated content	A number, a class, a rank, a probability
Training data	Trillions of internet tokens (pre-trained)	Your labelled rows, 10k to 100M of them
Eval signal	Human-graded rubric, RAGAS, LLM-as-judge	AUC, RMSE, NDCG, calibration, lift
Classical ML wins on eval rigour. AUC, RMSE, and calibration curves are deterministic, run the same held-out set twice and you get the same number. LLM eval relies on human graders or an LLM-as-judge, both of which introduce variance and cost. For regulated decisions (credit, fraud, medical triage) that measurability is often a compliance requirement, not just a preference.
Latency floor	200–2,000ms per call (frontier hosted)	1–50ms per inference (XGBoost on CPU)
Unit economics	$/M tokens, scales with output length	Pennies per million inferences amortised
At production volume, the gap is 100×–1,000× per call. A LightGBM model serving 10M inferences/day on a single CPU node costs roughly $30–50/month in compute. The equivalent call volume through a frontier-hosted LLM costs $3,000–$15,000/month depending on output length. Classical ML doesn't win on capability, it wins on the unit-economics of high-volume, low-complexity decisions.
Failure mode	Hallucinated facts, off-brand drift	Calibration drift, label leak, concept drift
Counter-intuitively, LLM failures are easier to catch. A hallucinated output is visible, a customer escalates, a reviewer flags it, a monitor catches the token. Classical ML failure is often silent: the fraud model's AUC holds at 0.87 while the population it was trained on has drifted six months. The model is still scoring, the dashboard is still green, and the business outcome is quietly degrading. That's why we instrument PSI and calibration checks from day one.
When it wins	Unstructured input, generation, reasoning	Tabular signal, ranking, forecasting, risk
This row is the one that gets teams in trouble. The categories aren't competing, they describe different input shapes and output contracts. The 2026 mistake is defaulting to an LLM because it's the obvious tool, then discovering on week six that a gradient-boosted model trained on your warehouse data is 3× more accurate and explainable on demand. We do the framing call before scoping so the wrong tool doesn't cost you a quarter.

Hybrid pattern: vision-LLM as feature extractor with a classical head, the third option, covered in §4 below.

We won't sell you a classical ML build for a generative problem, and we won't sell you an LLM engagement for a problem a gradient boosting model would solve in week three. The framing call is free.

002 / SERVICES

Four engagement shapes, fixed-scope.

Pilot, Production Build, Ranking-or-Recommendation, or an ML Audit. Every machine learning development services engagement maps to one of the four. Mixed engagements bill as two consecutive shapes, not an open-ended retainer.

01 / PILOT ↗

ML Pilot

One problem, one model family, one eval set. Baseline against a non-ML rule, ship a demo in 3–5 weeks. The way most clients start a machine learning development services engagement before committing to a custom ml development build.

3–5 wks

02 / BUILD ↗

Production ML Build

Full pipeline, feature store, training, serving, monitoring. The bulk of our machine learning development company revenue. Includes six weeks of post-launch iteration on calibration and drift.

8–14 wks

03 / RANK-OR-RECOMMEND ↗

Ranking / Recommendation

Catalogue-scale ranking models or recommendation systems development with a candidate-generation + re-ranking architecture. Pairs with an offline NDCG harness and an online A/B test rig.

6–10 wks

04 / ADVISORY ↗

ML Audit & Roadmap

Read of your current ml model development practice, the data pipeline, the eval rigour, and the deployment posture. Deliverable is a costed memo, not a model. Often the gate before a full Production Build.

2–3 wks

003 / PATTERNS

Four ML problem families. We pick on data shape, latency, and explainability.

Tabular-classify, forecasting, rank-or-recommend, and deep-vision cover roughly 95% of our custom ml development work. Framework, eval signal, deployment posture, and unit economics differ per family. About 60% of engagements start tabular; a third are forecasting or ranking; the rest deep or hybrid.

TABULAR-CLASSIFY

The biggest revenue line in classical ML in 2026, gradient boosting on tabular features still wins most business-prediction problems. XGBoost, LightGBM, or CatBoost as the model; Platt or isotonic calibration; SHAP for the explanation surface compliance always asks for. About 35% of our custom ml development engagements.

Pick when

Tabular features (counts, ratios, categoricals)
Sub-50ms inference budget on CPU
Need explanation per prediction (SHAP, regulator-readable)
Class imbalance manageable with weighting or SMOTE
50k+ labelled rows already in your warehouse

Skip when

Unstructured input (text, image, audio), neural nets or LLMs win
Very low data regimes under 5k rows, classical statistics may beat ML
Streaming features with sub-1ms latency, feature-engineering overhead kills it

Stack

XGBoostLightGBMCatBoostSHAP

004 / MODELS

Six model families. We pick per data shape and per latency budget.

No house model. We benchmark gradient boosting, neural nets, statistical baselines, and the vision-LLM-as-encoder pattern against the same held-out set, and ship the model that beats baseline by the agreed margin without breaking calibration. Roughly six in ten predictive analytics services we ship end on LightGBM or XGBoost.

Gradient Boosting (XGBoost · LightGBM · CatBoost)

Strengths

The default winner on tabular data in 2026, still beats deep learning on most business-prediction tasks. LightGBM for speed at scale, XGBoost for the broader SHAP/ONNX ecosystem, CatBoost for messy categoricals without one-hot encoding. Sub-millisecond inference on CPU. Platt and isotonic calibration mature.

When We Pick

About 60% of predictive analytics services we ship lead with LightGBM or XGBoost. Default for churn, fraud, credit risk, conversion, hierarchical demand forecast, ranking re-rankers. Anywhere SHAP explainability is a compliance requirement.

When We Don't

Unstructured input. Sub-millisecond streaming features where loading the model is the bottleneck. Tasks under 2,000 rows where a logistic regression with strong features is more honest.

Paiteq Pattern

We hand off the booster file, an inference shim, a SHAP explainer, and a calibration sidecar, small enough to deploy on the existing service stack, no GPU required.

TabularSHAP-readyCalibrated

PyTorch + Lightning

Strengths

The 2026 default for neural-net work that isn't an LLM. PyTorch 2.x core, Lightning for training-loop scaffolding, timm for vision backbones. Inference via ONNX Runtime or TensorRT on GPU; quantised CPU via Intel Neural Compressor or AWS Neuron for edge.

When We Pick

Image, audio, signal, or long-sequence input the LLM doesn't handle economically. Custom embeddings for a recommender. Fine-tuning a vision backbone on customer data. Custom encoders for a downstream classical head.

When We Don't

Tabular workloads, gradient boosting beats you on accuracy and cost. Pure LLM workloads, that's our LLM development services sibling.

Paiteq Pattern

We've shipped PyTorch-based defect detection on a factory line, an OCR encoder for a custom document type, and the content-tower of a two-tower recommender. ONNX Runtime for inference, easier to deploy than a Python service.

VisionCustom netONNX-export

scikit-learn + statsmodels

Strengths

The honest baseline that beats half the deep-learning press releases. Linear and logistic regression with proper feature engineering, regularised regression (ridge, lasso, elastic). statsmodels for ARIMA, exponential smoothing, Holt-Winters. Cheap, explainable, calibration usually better out of the box than tree ensembles.

When We Pick

Datasets under 5,000 rows, constrained deployment (mobile, on-prem no-GPU), or regulators that care about explanation. Always as the baseline against which we compare LightGBM and PyTorch.

When We Don't

Large datasets where regularisation can't capture non-linear structure. Tasks where feature engineering takes longer than fitting LightGBM.

Paiteq Pattern

Every engagement starts with a scikit-learn baseline. About one in five ends up shipping the baseline as the production model, usually risk scoring where the regulator wins.

BaselineLinearStatsmodels

Time-series, Prophet · LightGBM-on-lags · TimeGPT

Strengths

Three families cover most forecasting services work. Prophet is the explainable baseline with built-in seasonality. LightGBM-on-lags (engineered lag features into gradient boosting) is the workhorse, beats Prophet on accuracy at most scales. Nixtla's TimeGPT and Amazon Chronos are the foundation-model option when you have hundreds of series.

When We Pick

Demand forecasting at SKU scale, financial close, capacity planning, claims volume. LightGBM-on-lags 60% of the time; Prophet 25% for explainability; TimeGPT or Chronos 15% when series count justifies the cost.

When We Don't

Single short series under 200 observations, ARIMA wins. Pure anomaly detection, different family. High-frequency tick data, specialist stack we route out.

Paiteq Pattern

We shipped a hierarchical demand-forecast across 12,000 SKU-region pairs on LightGBM-on-lags that cut a retail client's holding-cost overrun by 18% in a quarter. Foundation-model forecasting was tested and lost on cost.

ForecastingSeasonalHierarchical

Embeddings + ANN (Faiss · ScaNN · pgvector)

Strengths

The candidate-generation half of modern ranking and recommendation systems development. Embeddings from a custom two-tower, a pre-trained sentence encoder, or a vision encoder. ANN indexes (Faiss self-hosted, ScaNN on Google, pgvector for Postgres-native) pull 200–2,000 candidates under 10ms.

When We Pick

Catalogue-scale ranking and recommender systems. Semantic search where keyword search misses. User-similarity lookups. Anywhere brute-force scoring blows the latency budget.

When We Don't

Catalogues under a thousand items, exhaustive scoring is faster than building an ANN index. RAG retrieval, see our RAG development services.

Paiteq Pattern

Faiss self-hosted; pgvector when data already lives in Postgres and ops capacity is thin; ScaNN on Google Cloud. Two-tower trained in PyTorch, exported to ONNX, served behind the ANN index.

Two-towerANNCandidate-gen

Vision-LLM-as-encoder (GPT-5 · Claude · Gemini)

Strengths

A 2026 pattern that didn't exist three years ago, use a frontier vision-LLM as feature extractor on image or document input, then put a classical model on the embeddings. GPT-5 Vision and Claude Sonnet 4.6 both expose embedding endpoints; Gemini 3.0 Pro has the cleanest multimodal one. Cuts the labelled-data requirement by an order of magnitude.

When We Pick

Image classification with 500–5,000 labels instead of 50,000. OCR-grade document classification where structure carries the signal. Sentiment plus extraction on screenshots, charts, scanned forms.

When We Don't

Latency-tight on-device workloads, the vision-LLM call kills the budget. Regulated workloads where data can't leave the perimeter, fine-tune a self-hosted vision backbone instead.

Paiteq Pattern

We pair this with our generative AI practice regularly, vision-LLM upstream as encoder, classical model downstream as predictor. Halves the labelling spend on most image-classification builds.

MultimodalFew-shotVision-LLM

005 / PIPELINE

The ml model development pipeline we ship, four phases, eval-gated.

The model is roughly a fifth of a real ml model development engagement. The other four-fifths is the data audit, feature pipeline, calibration layer, and drift instrumentation. Skipping any is how machine learning solutions silently fail at month six, the model that beat baseline on day one is now mis-calibrated on a shifted population.

01
Data audit + leakage map

Label quality, sample-selection bias, leakage paths, feature availability at inference time. Roughly half the engagements we audit have a leak we close before training, usually a feature only computable post-event. We document every join and time-of-availability per feature; the production system can only train on features it can compute at decision time.
02
Baseline + feature engineering

scikit-learn baseline always, logistic or linear with regularisation, fit on a starter feature set. The candidate has to beat the baseline by the agreed margin or we don't ship it. Feature engineering layered after: aggregations, lags, target encodings, learned embeddings. The pipeline ships with the model as a single deliverable.
03
Bench + calibration

LightGBM, XGBoost, scikit-learn baseline, and where relevant a PyTorch deep model bench-raced on the frozen held-out set. Hyperparameters via Optuna or a structured grid, not random; the search log is kept. Calibration fitted as a Platt or isotonic sidecar. The model that ships isn't the one with the highest raw AUC; it's the one that's well-ordered, well-calibrated, and fair across the slices the client cares about.
04
Deploy + drift instrumentation

FastAPI or BentoML serving on the runtime appropriate to latency budget; ONNX or pickle for the artefact. Production Stability Index per feature on a 30-day rolling window; alerting at 0.2, pause-decisioning at 0.5. Weekly calibration check on production data. Runbook names trigger conditions, on-call rota, and retrain procedure in writing, drift detection is not a mental checklist on a single engineer's laptop.

Retrain cadence ships in the runbook, quarterly default, monthly when drift is consistent. The client's team runs the retrain script after handoff; we retainer the operation only when asked.

006 / EVAL

Four gates. Every model. Every week.

Custom machine learning models without these four gates drift silently. Headline AUC stays clean while the business outcome degrades. Every model we hand off carries the gates as a contract, trigger conditions that fire a retrain included.

01 Held-out AUC / RMSE / NDCG

Beats baseline by ≥3pts

Frozen test set carved out at engagement start. AUC for binary classification, RMSE or MAPE for regression and forecasting, NDCG@k for ranking. Baseline is whichever non-ML rule the client already runs. If the candidate can't beat baseline by three points on the headline metric, we don't ship. Roughly one engagement in eight ends up shipping the baseline because the ML model couldn't clear the bar.

If the candidate doesn't beat baseline, we don't paper over it, we re-frame, harvest more labels, or close at the Pilot gate and bill against the audit memo. Confident-but-wrong ML is worse than no ML.
02 Calibration error

Brier < 0.18, ECE < 4%

Calibration matters more than raw AUC in most business-prediction problems, a churn score that's well-ordered but mis-calibrated breaks every downstream business rule. We measure Brier plus ECE on the held-out set, fit a Platt or isotonic sidecar where needed, and re-measure. Re-checked weekly on production data.

If calibration drifts above the threshold for two weeks, we re-fit the sidecar before re-fitting the model. Most drift comes from population shift, not model decay, the calibration layer is the right place to absorb it.
03 Concept-drift detection

PSI < 0.2 per feature

Production Stability Index per feature on a rolling 30-day window. Above 0.2 the feature has drifted enough to inspect; above 0.5 the model is outside the training envelope. Paired with output-distribution monitoring. Tracked in Evidently or a Postgres dashboard, depending on the client's MLOps capacity.

Any feature breaching 0.5 PSI fires a Slack alert and pauses automated decisioning if the use case warrants it (fraud, credit, healthcare). Retraining usually quarterly, monthly when drift is consistent. Trigger conditions documented in the runbook.
04 Fairness · slice metrics

≤5pt AUC gap across slices

AUC and calibration measured separately on the slices that matter, region, segment, protected class where legally relevant. Headline AUC can look fine while the worst slice is unusable. We surface the gap during model selection and discuss the tradeoff openly, sometimes the cheaper, fairer model wins.

Above-threshold gaps trigger a re-weighting pass or a slice-specific model. We don't ship where the fairness story is uncomfortable; if the data won't support a fair model, that's a finding we surface in writing.

007 / CAPABILITIES

Six capability families across six industries, where we've shipped.

A capability-by-industry heatgrid. Strength reflects what we've taken to production, not what we've explored. The light cells are honest, we won't claim depth we haven't built.

Function Industry

B2B SaaS

Fintech

E-commerce

Healthcare

Manufacturing

Logistics

Risk · Fraud · Credit

Churn · LTV · Conversion

Demand · Supply Forecast

Ranking · Search

Recommendation

Vision · Sensor · OCR

Risk · Fraud · Credit

B2B SaaSFintechE-commerceHealthcareLogistics Manufacturing

Churn · LTV · Conversion

B2B SaaSFintechE-commerceLogistics HealthcareManufacturing

Demand · Supply Forecast

B2B SaaSFintechE-commerceHealthcareManufacturingLogistics

Ranking · Search

B2B SaaSE-commerceHealthcare FintechManufacturingLogistics

Recommendation

B2B SaaSE-commerce FintechHealthcareManufacturingLogistics

Vision · Sensor · OCR

FintechE-commerceHealthcareManufacturingLogistics B2B SaaS

Possible fit Good fit Primary vertical

Dark cells: shipped at production scale. Medium: shipped in pilot. Light: experimented but not yet production. Empty cells are real.

008 / WHEN

When ML is the answer, and when it isn't.

The most expensive failure mode here is shipping ML where a 20-line rule would have done the job. The second-most is the inverse, running a hand-tuned heuristic two years past the point a calibrated gradient boosting model would have doubled the outcome. The list below is the screen we run on every inbound.

01
You have labels and a baseline

Labelled outcome rows (5k+ for tabular, 500+ for vision-LLM-encoder, 50k+ for deep vision from scratch) and a non-ML rule already running. ML's value is whatever it adds on top of the baseline, without one, the lift number is just a vibes-check.
02
The decision is repetitive and high-volume

Pricing one row at a time at hundreds of millions per day. Ranking a catalogue at every page view. Scoring a payment in 40ms. ML amortises the engineering cost across the volume, a single decision a quarter doesn't justify the build.
03
The cost of wrong is measurable

Fraud losses, missed revenue, holding-cost overrun, false positives in a regulated process. ML works when you can put a number on the failure mode, that's how we tune the threshold and prioritise the slices.
04
You can run an A/B test

Offline NDCG or AUC is a noisy signal without an online check. If your environment doesn't support a controlled rollout, the engagement gets riskier, sometimes the right answer is to fix the experimentation pipeline first.
05
It's a generative or reasoning problem, wrong family

Free-form text out, multi-step reasoning, document understanding with no clean label per row. That's LLM territory, not classical ML. We'll route you to our LLM development services sibling and run that engagement instead.
06
The data isn't there yet, wrong stage

Cold-start product, no logs, no labels, no baseline. ML is the wrong investment; the right one is a heuristic plus a measurement plan that builds the dataset for an ML build six months out. We'll say so in the audit memo.

If the screen lands clean on four of six, custom machine learning models are usually the right shape. Two or fewer, we'll often recommend something else, sometimes our AI consulting shape, sometimes a heuristic-plus-measurement plan, sometimes nothing.

009 / PROCESS

Eval-first, baseline-anchored, eight weeks to a calibrated model.

Metric and baseline land in week one, locked before training begins. Every model we ship is graded against the same frozen held-out set; nothing slides because the team got attached to a result. That's the difference between machine learning solutions that talk about quality and ones that measure it.

WEEK 1

Problem framing

Predict what, for whom, against which baseline. The eval metric is locked in week one, AUC, NDCG, RMSE, calibration band. The baseline is the non-ML rule already running.

WEEK 1–2

Data audit

Label quality, leakage paths, sample-selection bias, class balance, feature availability at inference time. Half the engagements we audit have a leak we have to close before training begins.

WEEK 2–4

Baseline + features

scikit-learn baseline always. Then engineered features, aggregations, lags, target encodings, embeddings. The feature pipeline is part of the deliverable; nothing trains on features the production system can't compute.

WEEK 4–6

Training + eval

Candidate models bench-raced against the baseline on the frozen held-out set. LightGBM, XGBoost, scikit-learn baseline, and where relevant a PyTorch deep model. Hyperparameters via Optuna or a structured grid.

WEEK 6–8

Calibration + slices

Brier, ECE, slice-AUC, calibration sidecar. Fairness review across the slices that matter. The model that ships isn't the one with the highest raw AUC, it's the one that's well-ordered, well-calibrated, and fair across the slices the client cares about.

WEEK 8+

Deploy + monitor

Serving layer (FastAPI, BentoML, or the client's existing pattern), ONNX or pickle artifact, feature-pipeline runbook, drift monitoring on PSI, weekly calibration check. Handoff to the client's ml model development team or our MLOps sibling.

010 / TIMELINE

What the eight-week Production ML Build looks like.

The standard custom ml development build, a defined slice ships in eight weeks. Ranking-or-Recommendation adds 2 weeks for the two-stage A/B harness; the Pilot is a tighter 3–5 week cut.

6 phases

WEEK 1 Problem framing

Locked metric, locked baseline, eval-set plan, data-access list

Metric + baseline sign-off

WEEK 2 Data audit

Leakage map, label-quality report, feature-availability matrix

Audit findings reviewed

WEEK 3–4 Baseline + features

scikit-learn baseline; engineered feature pipeline v1

Baseline > non-ML rule

WEEK 4–6 Model bench

LightGBM / PyTorch / linear bench; held-out scores

Best model ≥ baseline + 3pts

WEEK 6–8 Calibration + fairness

Platt or isotonic sidecar; slice-AUC report

ECE < 4%, slice gap ≤ 5pts

WEEK 8+ Deploy + monitor

Serving layer, drift dashboard, runbook, retrain cadence

First 30d of clean traces

011 / STACK

Frameworks we've shipped on.

Pinned to what we have in production in 2026. The actual integrations under support, not a marketing list.

XGBoost
LightGBM
CatBoost
scikit-learn
PyTorch
Lightning
timm
ONNX Runtime
Faiss
ScaNN
pgvector
Prophet
TimeGPT
Evidently
MLflow
Weights & Biases
XGBoost
LightGBM
CatBoost
scikit-learn
PyTorch
Lightning
timm
ONNX Runtime
Faiss
ScaNN
pgvector
Prophet
TimeGPT
Evidently
MLflow
Weights & Biases

012 / USE CASES

Where teams have shipped.

Three anonymized engagements. Function, segment, and outcome metric are real; brand removed under NDA.

E-commerce

DTC retail · catalogue-scale

Hierarchical demand forecast across thousands of SKU-region pairs

Typical shape: replace a Prophet-per-SKU pipeline with one LightGBM-on-lags model carrying hierarchical features (category, region, promo cadence). Training cost compresses materially; headline categories typically gain meaningful MAPE points against the prior baseline. Re-trained weekly on Modal. Pairs with the client's ERP for the planning loop.

Deliverable: hierarchical model + weekly retrain + planner-facing API

Fintech

Regulated lending · EU

Calibrated credit risk model with SHAP-led explanation

Typical shape: gradient boosting on application + bureau features, isotonic calibration sidecar, SHAP-based per-application reason codes. Slice-AUC reviewed across regional and demographic cuts before sign-off with the regulator-facing risk lead. Replaces drifted logistic scorecards. Live with quarterly retrain.

Deliverable: calibrated model + reason-code service + slice-fairness register

Logistics

Last-mile · enterprise routing

Cross-encoder ranking model on routing recommendations

Typical shape: two-stage, Faiss candidate-gen on driver-route embeddings, LightGBM ranker as cross-encoder. NDCG@10 measured offline on a frozen log; CTR measured online in an A/B test. Replaces hand-tuned heuristics that have been the dispatcher-productivity bottleneck for years.

Deliverable: ranker + retrieval index + offline + online eval harness

013 / WHY PAITEQ

Why teams pick us as their machine learning development services partner.

01
Baseline-anchored or we don't ship

Every candidate model fights a scikit-learn baseline on the frozen held-out set. If it doesn't beat by the agreed margin, the engagement ends at the audit memo, not at a ship-it-anyway compromise. About one engagement in eight closes here, and the client gets a heuristic-plus-measurement plan instead.
02
Calibrated, not just AUC-clean

Headline AUC without calibration is a setup for downstream failures. We measure Brier, expected calibration error, and slice-AUC before the deploy, and instrument drift on every feature in production. The model that ships is the one that holds up at month six, not the one that wins on day one.
03
You own the artefact

The booster file, the ONNX export, the feature pipeline, the calibration sidecar, the eval harness, the runbook, yours, in your repo, under your license. We don't retain joint IP on customer-trained models. Retainer the operation if you want; the model is owned by you regardless.

014 / ENGAGE

Four ways to start.

01 ML Pilot Fixed scope

3–5 weeks

One problem, one model, one eval set.

In scope

Problem framing + locked metric
scikit-learn baseline
Candidate model bench against baseline
Held-out eval report + costed go-no-go memo

Out of scope

Production deployment
Drift instrumentation
Ongoing retraining (separate Build)

02 Production ML Build Fixed scope

8–14 weeks

Calibrated, drift-monitored, owned by you.

In scope

All Pilot deliverables
Feature pipeline in your repo
Calibration sidecar
Drift instrumentation + runbook
Six weeks of post-launch iteration

03 Ranking / Recommendation Fixed scope

6–10 weeks

Two-stage candidate-gen + re-rank.

In scope

Candidate-generation index (Faiss / ScaNN / pgvector)
Cross-encoder re-ranker training
Offline NDCG harness
Online A/B test rig wired

04 ML Audit & Roadmap Fixed scope

2–3 weeks

Read of practice + costed roadmap.

In scope

Data audit, eval-rigour audit, deployment-posture read
Leakage and drift findings in writing
Costed roadmap memo for the next 6–12 months

015 / FAQ

What buyers ask before signing.

When do we pick classical ML over an LLM?

The short answer, whenever the inputs are tabular rows and the output is a number, a class, or a ranked list. Gradient boosting on engineered features still beats every LLM-shaped solution we've benchmarked for those problems at roughly a thousandth of the inference cost. The 2026 mistake we see most often is teams reaching for an LLM where a 50-millisecond LightGBM model would be more accurate and roughly 100× cheaper. LLMs win when input is unstructured and output is generative or reasoning-shaped. The hybrid pattern, vision-LLM as feature extractor, classical model as predictor, is the third option, and it's where this practice meets our LLM development services.

How much labelled data do we actually need for a custom ML build?

Depends on the problem family and the model. For tabular gradient boosting on a business-prediction problem, 5,000–20,000 labelled rows is usually enough to beat a heuristic by a defensible margin; 100,000+ is where the model gets sharp. Deep vision from scratch typically needs 10,000–500,000 examples. The vision-LLM-as-encoder pattern collapses that to 500–5,000, the single biggest unlock in custom ml development since transformers landed. If you don't have labels and can't get them cheaply, we'll say so during the audit. Sometimes the answer is active learning, sometimes synthetic data, sometimes a heuristic-plus-measurement plan that gets you to ML in six months.

Who owns the model after delivery, Paiteq or the client?

The client. Every machine learning development services engagement ends with the model artefact (booster file, ONNX export, PyTorch checkpoint), the training code, the feature pipeline, the eval harness, and a runbook in your repo under your license. We don't retain joint IP on customer-trained models, the trained weights, the engineered features, the calibration sidecar are yours. What we retain is methodology: build templates, eval rubrics, audit playbooks. Operate the model on a retainer if you want; the artefact stays yours regardless.

How do you handle drift, calibration, and retraining cadence?

Three layers. Production Stability Index per feature on a rolling 30-day window, 0.2 we inspect, 0.5 pauses automated decisioning if the use case requires it. Calibration check on a production sample every week, Brier and expected calibration error against the band agreed at launch. Scheduled retraining cadence, usually quarterly, pulled to monthly when drift is consistent. The runbook names the trigger conditions in writing; alerts route to a named on-call rota; the retrain procedure is a script the client's team can run without us. Drift detection is engineered, not a mental checklist.

How is this different from your MLOps service and your LLM service?

Three siblings, three different jobs. Machine learning development services here is about building the model, framing, data audit, features, training, eval, calibration. MLOps services is the infrastructure around the model after ship, serving, monitoring, retraining, feature stores. LLM development services is the equivalent practice for large language models, different family, different eval, different unit economics. An engagement can span more than one; we scope each shape separately so the deliverables are unambiguous.

Where custom ML connects.

The two industries where classical ML still beats LLM-only solutions for production-grade decisions are custom AI insurance development (underwriting, calibrated probabilities, regulator-defensible feature importance) and AI for fintech fraud and credit ML (latency budgets that rule out 8-billion-parameter inference). Both are heavy-ML, light-LLM territories. The wider context of where this practice sits in the Paiteq engineering bench covers the cross-discipline team that delivers it; the full AI development services menu shows adjacent practices.

When the workload sits next to an LLM surface, the right routes are custom LLM application development, hybrid retrieval and reranking pipelines (where the reranker is often a fine-tuned cross-encoder, not an LLM), and tool-calling autonomous agent builds that wrap the ML model as one tool among many. The operations layer for all of this lives in MLOps and LLMOps — same monitoring spine.

016 / Related practices