P6 · Services

Machine learning development services + custom ML development — calibrated, drift-monitored, owned by you

Custom ML development from a machine learning development company that ships gradient boosting on tabular, PyTorch on vision, hierarchical forecasting at SKU scale, and two-stage ranking. Every model graded on calibration, slice-fairness, and drift — not just headline AUC.

Practice Custom ML development
Stack XGBoost · LightGBM · PyTorch · Faiss
Eval AUC · NDCG · Brier · PSI
Engagements 3–14 weeks · fixed scope
001 / WHEN IT WINS

Classical ML vs LLM — when each one wins.

The 2026 mistake we see most often is teams reaching for an LLM where a 50-millisecond LightGBM model would be more accurate, more explainable, and roughly a hundred times cheaper. Custom ML development is still the right answer for most business-prediction problems — fraud, churn, credit, demand, ranking. The two families don't compete head-to-head; they win on different shapes of input and output.

LLM-shaped problem Machine-learning-shaped problem
What's predicted Free-form text, tokens, generated content A number, a class, a rank, a probability
Training data Trillions of internet tokens (pre-trained) Your labelled rows, 10k to 100M of them
Latency floor 200–2,000ms per call (frontier hosted) 1–50ms per inference (XGBoost on CPU)
Hybrid pattern: vision-LLM as feature extractor with a classical head — the third option, covered in §4 below.

We won't sell you a classical ML build for a generative problem, and we won't sell you an LLM engagement for a problem a gradient boosting model would solve in week three. The framing call is free.

002 / SERVICES

Four engagement shapes, fixed-scope.

Pilot, Production Build, Ranking-or-Recommendation, or an ML Audit. Every machine learning development services engagement maps to one of the four. Mixed engagements bill as two consecutive shapes, not an open-ended retainer.

003 / PATTERNS

Four ML problem families. We pick on data shape, latency, and explainability.

Tabular-classify, forecasting, rank-or-recommend, and deep-vision cover roughly 95% of our custom ml development work. Framework, eval signal, deployment posture, and unit economics differ per family. About 60% of engagements start tabular; a third are forecasting or ranking; the rest deep or hybrid.

01

TABULAR-CLASSIFY

The biggest revenue line in classical ML in 2026 — gradient boosting on tabular features still wins most business-prediction problems. XGBoost, LightGBM, or CatBoost as the model; Platt or isotonic calibration; SHAP for the explanation surface compliance always asks for. About 35% of our custom ml development engagements.

Pick when
  • Tabular features (counts, ratios, categoricals)
  • Sub-50ms inference budget on CPU
  • Need explanation per prediction (SHAP, regulator-readable)
  • Class imbalance manageable with weighting or SMOTE
  • 50k+ labelled rows already in your warehouse
Skip when
  • Unstructured input (text, image, audio) — neural nets or LLMs win
  • Very low data regimes under 5k rows — classical statistics may beat ML
  • Streaming features with sub-1ms latency — feature-engineering overhead kills it
Stack
XGBoostLightGBMCatBoostSHAP
004 / MODELS

Six model families. We pick per data shape and per latency budget.

No house model. We benchmark gradient boosting, neural nets, statistical baselines, and the vision-LLM-as-encoder pattern against the same held-out set, and ship the model that beats baseline by the agreed margin without breaking calibration. Roughly six in ten predictive analytics services we ship end on LightGBM or XGBoost.

Gradient Boosting (XGBoost · LightGBM · CatBoost)

The default winner on tabular data in 2026 — still beats deep learning on most business-prediction tasks. LightGBM for speed at scale, XGBoost for the broader SHAP/ONNX ecosystem, CatBoost for messy categoricals without one-hot encoding. Sub-millisecond inference on CPU. Platt and isotonic calibration mature.

About 60% of predictive analytics services we ship lead with LightGBM or XGBoost. Default for churn, fraud, credit risk, conversion, hierarchical demand forecast, ranking re-rankers. Anywhere SHAP explainability is a compliance requirement.

Unstructured input. Sub-millisecond streaming features where loading the model is the bottleneck. Tasks under 2,000 rows where a logistic regression with strong features is more honest.

We hand off the booster file, an inference shim, a SHAP explainer, and a calibration sidecar — small enough to deploy on the existing service stack, no GPU required.

TabularSHAP-readyCalibrated
PyTorch + Lightning

The 2026 default for neural-net work that isn't an LLM. PyTorch 2.x core, Lightning for training-loop scaffolding, timm for vision backbones. Inference via ONNX Runtime or TensorRT on GPU; quantised CPU via Intel Neural Compressor or AWS Neuron for edge.

Image, audio, signal, or long-sequence input the LLM doesn't handle economically. Custom embeddings for a recommender. Fine-tuning a vision backbone on customer data. Custom encoders for a downstream classical head.

Tabular workloads — gradient boosting beats you on accuracy and cost. Pure LLM workloads — that's <a href="/services/llm-development/">our LLM development services</a> sibling.

We've shipped PyTorch-based defect detection on a factory line, an OCR encoder for a custom document type, and the content-tower of a two-tower recommender. ONNX Runtime for inference — easier to deploy than a Python service.

VisionCustom netONNX-export
scikit-learn + statsmodels

The honest baseline that beats half the deep-learning press releases. Linear and logistic regression with proper feature engineering, regularised regression (ridge, lasso, elastic). statsmodels for ARIMA, exponential smoothing, Holt-Winters. Cheap, explainable, calibration usually better out of the box than tree ensembles.

Datasets under 5,000 rows, constrained deployment (mobile, on-prem no-GPU), or regulators that care about explanation. Always as the baseline against which we compare LightGBM and PyTorch.

Large datasets where regularisation can't capture non-linear structure. Tasks where feature engineering takes longer than fitting LightGBM.

Every engagement starts with a scikit-learn baseline. About one in five ends up shipping the baseline as the production model — usually risk scoring where the regulator wins.

BaselineLinearStatsmodels
Time-series — Prophet · LightGBM-on-lags · TimeGPT

Three families cover most forecasting services work. Prophet is the explainable baseline with built-in seasonality. LightGBM-on-lags (engineered lag features into gradient boosting) is the workhorse — beats Prophet on accuracy at most scales. Nixtla's TimeGPT and Amazon Chronos are the foundation-model option when you have hundreds of series.

Demand forecasting at SKU scale, financial close, capacity planning, claims volume. LightGBM-on-lags 60% of the time; Prophet 25% for explainability; TimeGPT or Chronos 15% when series count justifies the cost.

Single short series under 200 observations — ARIMA wins. Pure anomaly detection — different family. High-frequency tick data — specialist stack we route out.

We shipped a hierarchical demand-forecast across 12,000 SKU-region pairs on LightGBM-on-lags that cut a retail client's holding-cost overrun by 18% in a quarter. Foundation-model forecasting was tested and lost on cost.

ForecastingSeasonalHierarchical
Embeddings + ANN (Faiss · ScaNN · pgvector)

The candidate-generation half of modern ranking and recommendation systems development. Embeddings from a custom two-tower, a pre-trained sentence encoder, or a vision encoder. ANN indexes (Faiss self-hosted, ScaNN on Google, pgvector for Postgres-native) pull 200–2,000 candidates under 10ms.

Catalogue-scale ranking and recommender systems. Semantic search where keyword search misses. User-similarity lookups. Anywhere brute-force scoring blows the latency budget.

Catalogues under a thousand items — exhaustive scoring is faster than building an ANN index. RAG retrieval — see <a href="/services/rag-development/">our RAG development services</a>.

Faiss self-hosted; pgvector when data already lives in Postgres and ops capacity is thin; ScaNN on Google Cloud. Two-tower trained in PyTorch, exported to ONNX, served behind the ANN index.

Two-towerANNCandidate-gen
Vision-LLM-as-encoder (GPT-5 · Claude · Gemini)

A 2026 pattern that didn't exist three years ago — use a frontier vision-LLM as feature extractor on image or document input, then put a classical model on the embeddings. GPT-5 Vision and Claude Sonnet 4.6 both expose embedding endpoints; Gemini 3.0 Pro has the cleanest multimodal one. Cuts the labelled-data requirement by an order of magnitude.

Image classification with 500–5,000 labels instead of 50,000. OCR-grade document classification where structure carries the signal. Sentiment plus extraction on screenshots, charts, scanned forms.

Latency-tight on-device workloads — the vision-LLM call kills the budget. Regulated workloads where data can't leave the perimeter — fine-tune a self-hosted vision backbone instead.

We pair this with our <a href="/services/generative-ai/">generative AI practice</a> regularly — vision-LLM upstream as encoder, classical model downstream as predictor. Halves the labelling spend on most image-classification builds.

MultimodalFew-shotVision-LLM
005 / PIPELINE

The ml model development pipeline we ship — four phases, eval-gated.

The model is roughly a fifth of a real ml model development engagement. The other four-fifths is the data audit, feature pipeline, calibration layer, and drift instrumentation. Skipping any is how machine learning solutions silently fail at month six — the model that beat baseline on day one is now mis-calibrated on a shifted population.

  1. 01

    Data audit + leakage map

    Label quality, sample-selection bias, leakage paths, feature availability at inference time. Roughly half the engagements we audit have a leak we close before training — usually a feature only computable post-event. We document every join and time-of-availability per feature; the production system can only train on features it can compute at decision time.

  2. 02

    Baseline + feature engineering

    scikit-learn baseline always — logistic or linear with regularisation, fit on a starter feature set. The candidate has to beat the baseline by the agreed margin or we don't ship it. Feature engineering layered after: aggregations, lags, target encodings, learned embeddings. The pipeline ships with the model as a single deliverable.

  3. 03

    Bench + calibration

    LightGBM, XGBoost, scikit-learn baseline, and where relevant a PyTorch deep model bench-raced on the frozen held-out set. Hyperparameters via Optuna or a structured grid — not random; the search log is kept. Calibration fitted as a Platt or isotonic sidecar. The model that ships isn't the one with the highest raw AUC; it's the one that's well-ordered, well-calibrated, and fair across the slices the client cares about.

  4. 04

    Deploy + drift instrumentation

    FastAPI or BentoML serving on the runtime appropriate to latency budget; ONNX or pickle for the artefact. Production Stability Index per feature on a 30-day rolling window; alerting at 0.2, pause-decisioning at 0.5. Weekly calibration check on production data. Runbook names trigger conditions, on-call rota, and retrain procedure in writing — drift detection is not a mental checklist on a single engineer's laptop.

Retrain cadence ships in the runbook — quarterly default, monthly when drift is consistent. The client's team runs the retrain script after handoff; we retainer the operation only when asked.

006 / EVAL

Four gates. Every model. Every week.

Custom machine learning models without these four gates drift silently. Headline AUC stays clean while the business outcome degrades. Every model we hand off carries the gates as a contract — trigger conditions that fire a retrain included.

  1. 01 Held-out AUC / RMSE / NDCG
    Beats baseline by ≥3pts

    Frozen test set carved out at engagement start. AUC for binary classification, RMSE or MAPE for regression and forecasting, NDCG@k for ranking. Baseline is whichever non-ML rule the client already runs. If the candidate can't beat baseline by three points on the headline metric, we don't ship. Roughly one engagement in eight ends up shipping the baseline because the ML model couldn't clear the bar.

    If the candidate doesn't beat baseline, we don't paper over it — we re-frame, harvest more labels, or close at the Pilot gate and bill against the audit memo. Confident-but-wrong ML is worse than no ML.

  2. 02 Calibration error
    Brier < 0.18, ECE < 4%

    Calibration matters more than raw AUC in most business-prediction problems — a churn score that's well-ordered but mis-calibrated breaks every downstream business rule. We measure Brier plus ECE on the held-out set, fit a Platt or isotonic sidecar where needed, and re-measure. Re-checked weekly on production data.

    If calibration drifts above the threshold for two weeks, we re-fit the sidecar before re-fitting the model. Most drift comes from population shift, not model decay — the calibration layer is the right place to absorb it.

  3. 03 Concept-drift detection
    PSI < 0.2 per feature

    Production Stability Index per feature on a rolling 30-day window. Above 0.2 the feature has drifted enough to inspect; above 0.5 the model is outside the training envelope. Paired with output-distribution monitoring. Tracked in Evidently or a Postgres dashboard, depending on the client's MLOps capacity.

    Any feature breaching 0.5 PSI fires a Slack alert and pauses automated decisioning if the use case warrants it (fraud, credit, healthcare). Retraining usually quarterly, monthly when drift is consistent. Trigger conditions documented in the runbook.

  4. 04 Fairness · slice metrics
    ≤5pt AUC gap across slices

    AUC and calibration measured separately on the slices that matter — region, segment, protected class where legally relevant. Headline AUC can look fine while the worst slice is unusable. We surface the gap during model selection and discuss the tradeoff openly — sometimes the cheaper, fairer model wins.

    Above-threshold gaps trigger a re-weighting pass or a slice-specific model. We don't ship where the fairness story is uncomfortable; if the data won't support a fair model, that's a finding we surface in writing.

007 / CAPABILITIES

Six capability families across six industries — where we've shipped.

A capability-by-industry heatgrid. Strength reflects what we've taken to production, not what we've explored. The light cells are honest — we won't claim depth we haven't built.

Function Industry
B2B SaaS
Fintech
E-commerce
Healthcare
Manufacturing
Logistics
Risk · Fraud · Credit
Churn · LTV · Conversion
Demand · Supply Forecast
Ranking · Search
Recommendation
Vision · Sensor · OCR
Risk · Fraud · Credit
B2B SaaSFintechE-commerceHealthcareLogistics Manufacturing
Churn · LTV · Conversion
B2B SaaSFintechE-commerceLogistics HealthcareManufacturing
Demand · Supply Forecast
B2B SaaSFintechE-commerceHealthcareManufacturingLogistics
Ranking · Search
B2B SaaSE-commerceHealthcare FintechManufacturingLogistics
Recommendation
B2B SaaSE-commerce FintechHealthcareManufacturingLogistics
Vision · Sensor · OCR
FintechE-commerceHealthcareManufacturingLogistics B2B SaaS
Possible fit Good fit Primary vertical

Dark cells: shipped at production scale. Medium: shipped in pilot. Light: experimented but not yet production. Empty cells are real.

008 / WHEN

When ML is the answer — and when it isn't.

The most expensive failure mode here is shipping ML where a 20-line rule would have done the job. The second-most is the inverse — running a hand-tuned heuristic two years past the point a calibrated gradient boosting model would have doubled the outcome. The list below is the screen we run on every inbound.

If the screen lands clean on four of six, custom machine learning models are usually the right shape. Two or fewer, we'll often recommend something else — sometimes our AI consulting shape, sometimes a heuristic-plus-measurement plan, sometimes nothing.

009 / PROCESS

Eval-first, baseline-anchored, eight weeks to a calibrated model.

Metric and baseline land in week one — locked before training begins. Every model we ship is graded against the same frozen held-out set; nothing slides because the team got attached to a result. That's the difference between machine learning solutions that talk about quality and ones that measure it.

WEEK 1

Problem framing

Predict what, for whom, against which baseline. The eval metric is locked in week one — AUC, NDCG, RMSE, calibration band. The baseline is the non-ML rule already running.

WEEK 1–2

Data audit

Label quality, leakage paths, sample-selection bias, class balance, feature availability at inference time. Half the engagements we audit have a leak we have to close before training begins.

WEEK 2–4

Baseline + features

scikit-learn baseline always. Then engineered features — aggregations, lags, target encodings, embeddings. The feature pipeline is part of the deliverable; nothing trains on features the production system can't compute.

WEEK 4–6

Training + eval

Candidate models bench-raced against the baseline on the frozen held-out set. LightGBM, XGBoost, scikit-learn baseline, and where relevant a PyTorch deep model. Hyperparameters via Optuna or a structured grid.

WEEK 6–8

Calibration + slices

Brier, ECE, slice-AUC, calibration sidecar. Fairness review across the slices that matter. The model that ships isn't the one with the highest raw AUC — it's the one that's well-ordered, well-calibrated, and fair across the slices the client cares about.

WEEK 8+

Deploy + monitor

Serving layer (FastAPI, BentoML, or the client's existing pattern), ONNX or pickle artifact, feature-pipeline runbook, drift monitoring on PSI, weekly calibration check. Handoff to the client's ml model development team or our MLOps sibling.

010 / TIMELINE

What the eight-week Production ML Build looks like.

The standard custom ml development build — a defined slice ships in eight weeks. Ranking-or-Recommendation adds 2 weeks for the two-stage A/B harness; the Pilot is a tighter 3–5 week cut.

6 phases
WEEK 1 Problem framing

Locked metric, locked baseline, eval-set plan, data-access list

Metric + baseline sign-off

WEEK 2 Data audit

Leakage map, label-quality report, feature-availability matrix

Audit findings reviewed

WEEK 3–4 Baseline + features

scikit-learn baseline; engineered feature pipeline v1

Baseline > non-ML rule

WEEK 4–6 Model bench

LightGBM / PyTorch / linear bench; held-out scores

Best model ≥ baseline + 3pts

WEEK 6–8 Calibration + fairness

Platt or isotonic sidecar; slice-AUC report

ECE < 4%, slice gap ≤ 5pts

WEEK 8+ Deploy + monitor

Serving layer, drift dashboard, runbook, retrain cadence

First 30d of clean traces

011 / STACK

Frameworks we've shipped on.

Pinned to what we have in production in 2026. The actual integrations under support — not a marketing list.

  • XGBoost
  • LightGBM
  • CatBoost
  • scikit-learn
  • PyTorch
  • Lightning
  • timm
  • ONNX Runtime
  • Faiss
  • ScaNN
  • pgvector
  • Prophet
  • TimeGPT
  • Evidently
  • MLflow
  • Weights &amp; Biases
  • XGBoost
  • LightGBM
  • CatBoost
  • scikit-learn
  • PyTorch
  • Lightning
  • timm
  • ONNX Runtime
  • Faiss
  • ScaNN
  • pgvector
  • Prophet
  • TimeGPT
  • Evidently
  • MLflow
  • Weights &amp; Biases
012 / USE CASES

Where teams have shipped.

Three anonymized engagements. Function, segment, and outcome metric are real; brand removed under NDA.

E-commerce
DTC retail · catalogue-scale

Hierarchical demand forecast across thousands of SKU-region pairs

Typical shape: replace a Prophet-per-SKU pipeline with one LightGBM-on-lags model carrying hierarchical features (category, region, promo cadence). Training cost compresses materially; headline categories typically gain meaningful MAPE points against the prior baseline. Re-trained weekly on Modal. Pairs with the client's ERP for the planning loop.

Deliverable: hierarchical model + weekly retrain + planner-facing API
Fintech
Regulated lending · EU

Calibrated credit risk model with SHAP-led explanation

Typical shape: gradient boosting on application + bureau features, isotonic calibration sidecar, SHAP-based per-application reason codes. Slice-AUC reviewed across regional and demographic cuts before sign-off with the regulator-facing risk lead. Replaces drifted logistic scorecards. Live with quarterly retrain.

Deliverable: calibrated model + reason-code service + slice-fairness register
Logistics
Last-mile · enterprise routing

Cross-encoder ranking model on routing recommendations

Typical shape: two-stage — Faiss candidate-gen on driver-route embeddings, LightGBM ranker as cross-encoder. NDCG@10 measured offline on a frozen log; CTR measured online in an A/B test. Replaces hand-tuned heuristics that have been the dispatcher-productivity bottleneck for years.

Deliverable: ranker + retrieval index + offline + online eval harness
013 / WHY PAITEQ

Why teams pick us as their machine learning development services partner.

014 / ENGAGE

Four ways to start.

01 ML Pilot Fixed scope
3–5 weeks

One problem, one model, one eval set.

In scope
  • Problem framing + locked metric
  • scikit-learn baseline
  • Candidate model bench against baseline
  • Held-out eval report + costed go-no-go memo
Out of scope
  • Production deployment
  • Drift instrumentation
  • Ongoing retraining (separate Build)
02 Production ML Build Fixed scope
8–14 weeks

Calibrated, drift-monitored, owned by you.

In scope
  • All Pilot deliverables
  • Feature pipeline in your repo
  • Calibration sidecar
  • Drift instrumentation + runbook
  • Six weeks of post-launch iteration
03 Ranking / Recommendation Fixed scope
6–10 weeks

Two-stage candidate-gen + re-rank.

In scope
  • Candidate-generation index (Faiss / ScaNN / pgvector)
  • Cross-encoder re-ranker training
  • Offline NDCG harness
  • Online A/B test rig wired
04 ML Audit & Roadmap Fixed scope
2–3 weeks

Read of practice + costed roadmap.

In scope
  • Data audit, eval-rigour audit, deployment-posture read
  • Leakage and drift findings in writing
  • Costed roadmap memo for the next 6–12 months
015 / FAQ

What buyers ask before signing.

When do we pick classical ML over an LLM?

The short answer — whenever the inputs are tabular rows and the output is a number, a class, or a ranked list. Gradient boosting on engineered features still beats every LLM-shaped solution we've benchmarked for those problems at roughly a thousandth of the inference cost. The 2026 mistake we see most often is teams reaching for an LLM where a 50-millisecond LightGBM model would be more accurate and roughly 100× cheaper. LLMs win when input is unstructured and output is generative or reasoning-shaped. The hybrid pattern — vision-LLM as feature extractor, classical model as predictor — is the third option, and it's where this practice meets our LLM development services.

How much labelled data do we actually need for a custom ML build?

Depends on the problem family and the model. For tabular gradient boosting on a business-prediction problem, 5,000–20,000 labelled rows is usually enough to beat a heuristic by a defensible margin; 100,000+ is where the model gets sharp. Deep vision from scratch typically needs 10,000–500,000 examples. The vision-LLM-as-encoder pattern collapses that to 500–5,000 — the single biggest unlock in custom ml development since transformers landed. If you don't have labels and can't get them cheaply, we'll say so during the audit. Sometimes the answer is active learning, sometimes synthetic data, sometimes a heuristic-plus-measurement plan that gets you to ML in six months.

Who owns the model after delivery — Paiteq or the client?

The client. Every machine learning development services engagement ends with the model artefact (booster file, ONNX export, PyTorch checkpoint), the training code, the feature pipeline, the eval harness, and a runbook in your repo under your license. We don't retain joint IP on customer-trained models — the trained weights, the engineered features, the calibration sidecar are yours. What we retain is methodology: build templates, eval rubrics, audit playbooks. Operate the model on a retainer if you want; the artefact stays yours regardless.

How do you handle drift, calibration, and retraining cadence?

Three layers. Production Stability Index per feature on a rolling 30-day window — 0.2 we inspect, 0.5 pauses automated decisioning if the use case requires it. Calibration check on a production sample every week — Brier and expected calibration error against the band agreed at launch. Scheduled retraining cadence, usually quarterly, pulled to monthly when drift is consistent. The runbook names the trigger conditions in writing; alerts route to a named on-call rota; the retrain procedure is a script the client's team can run without us. Drift detection is engineered, not a mental checklist.

How is this different from your MLOps service and your LLM service?

Three siblings, three different jobs. Machine learning development services here is about building the model — framing, data audit, features, training, eval, calibration. MLOps services is the infrastructure around the model after ship — serving, monitoring, retraining, feature stores. LLM development services is the equivalent practice for large language models — different family, different eval, different unit economics. An engagement can span more than one; we scope each shape separately so the deliverables are unambiguous.

017 / Start a project

Ship a calibrated model in 8 weeks.

Pilot in 3–5. Production Build in 8–14. Ranking / Recommendation in 6–10. ML Audit in 2–3.