# AI readiness assessment: a vendor-neutral scoring rubric

> A vendor-neutral AI readiness assessment: a five-dimension scoring rubric (data, infrastructure, model, team, economics) with weights and honest go/no-go thresholds.

**HTML version:** https://www.paiteq.com/blog/ai-readiness-assessment/
**Published:** 2026-05-29T06:34:36.431Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~17 min


---

Most teams who run an AI readiness assessment score themselves a point too high on every dimension, and the reason is structural: the people filling in the scorecard are the people whose budget depends on the answer being yes. The vendor tools that dominate this search are worse, because they're built to return a green light that leads to a product. So this is the artifact the rest of the page-one results withhold: a vendor-neutral AI readiness assessment you can run in an afternoon, scored across five weighted dimensions on a 0-to-4 maturity scale, with honest go/no-go thresholds that are allowed to tell you to wait. The rubric is the page. Everything else is the reasoning behind the weights.
We've written this for the VP of Strategy, the CDO, and the CTO who own the AI budget and have to defend whether to start, pilot, or hold in front of a board. The register is engineering-led on purpose, because the dimensions that actually predict whether an AI project survives production are technical ones the strategy decks skip. We're makers, not a slide-deck shop, so the bias throughout is toward cost-per-task economics and measurable capability over transformation narrative. Each of the five dimensions gets a concrete 0-to-4 descriptor set, a weight, and a gate. Where a number anchors a decision you'll see the number; where a tool earns a mention it gets named. The scoring rubric below is the same one we walk a steering committee through, minus the engagement pricing that doesn't belong on a public page.

## AI readiness assessment in one paragraph, and why most teams score themselves too high

Working definition. An AI readiness assessment is a scored measurement of whether an organisation can build and operate an AI system that survives contact with production, taken across the five dimensions that actually predict survival: data readiness above all, then model and evaluation capability, then infrastructure, team, and economics. It is not a maturity badge that puts you on a 1-to-5 curve and stops, and it is not a vendor questionnaire that grades your appetite to buy. The output that matters is a weighted score with a go/no-go verdict an engineering leader can defend, and that's what this guide ships.
Here's the first opinionated take, the one that explains nearly every assessment that scores a team ready and then watches the project stall. Most AI readiness assessments overweight infrastructure and underweight data, which is exactly backwards. You can rent infrastructure in an afternoon: AWS Bedrock, Vertex AI, and Azure OpenAI all stand up a usable inference endpoint before lunch, and a GPU is a credit-card transaction. What you cannot rent is eighteen months of clean and labelled event data in the queryable shape a model can read. So an honest rubric puts the heaviest weight on data readiness and lets a low data score cap the overall verdict, no matter how good the infrastructure column looks. The tools at the top of this search invert that, because infrastructure is the column they sell.
The second reason teams over-score is that nobody runs the assessment against a falsifiable test. "We have good data" becomes a 4 in the self-assessment and a 1 the moment an engineer tries to pull it with a query. The fix runs through every dimension below: score against what an engineer can do today, not what the org believes it could do with a quarter of cleanup. For the procurement context that sits underneath this, our [generative AI services buyer's guide](/blog/generative-ai-services-buyers-guide/) covers how to buy once the assessment says you're ready.

## What an AI readiness assessment measures (and the dimensions vendor tools skip)

The category boundary that costs buyers the most is the one between an assessment that measures capability and a questionnaire that measures intent. The interactive wizards from the big platforms grade how ready you are to adopt their stack; the consultancy definitions grade where you sit on a generic maturity curve. Neither tells an engineering team whether the data exists, whether anyone can write a golden-set eval, or whether the unit economics survive at real volume. Here's how we draw the line, because the distinction determines whether you've measured readiness or measured enthusiasm.
The structural difference is the cap rule. A real assessment has dimensions that can veto the overall verdict regardless of the weighted total, because some weaknesses aren't averaged away by strengths elsewhere. The Gartner AI maturity model and the McKinsey AI maturity framework are useful inputs to the team-and-process dimension, and the Cisco AI Readiness Index is a reasonable lens on the infrastructure column, but they're diagnostics that report a position, not rubrics that return a go/no-go. That's the gap this AI readiness assessment closes: it converts a position on a curve into a defensible decision with thresholds attached.
One more boundary worth stating plainly. A readiness assessment is not a tool-selection exercise. Whether you eventually run Claude Sonnet 4.6 or GPT-5 mini, store vectors in pgvector or Pinecone, or evaluate with Ragas or LangSmith, those are downstream implementation choices, not readiness signals. A team that's ready can pick tools; a team that isn't ready can't be made ready by the right tool. Keep that arrow pointing the right way and the scoring stays honest: capability first, vendors later.

## The five dimensions of AI readiness: data, infrastructure, model capability, team, economics

Every assessment we run scores the same five dimensions, weighted by how strongly each one predicts production survival rather than how easy it is to measure. Data readiness carries the most weight because it's the slowest to fix and the most often faked. Model and evaluation capability comes next, because a team that can't measure quality ships drift. Infrastructure is third and deliberately light, because it's the dimension you can buy. Team and process, then economics, round it out. The weights aren't arbitrary; they encode where programs actually die.
Read the scorecard as a shape, not a single number. The team in the figure has strong data and economics columns but a model-and-eval score of 1, which is the most dangerous profile there is: it ships a convincing demo and a product that drifts, because nobody can measure quality before release. A weighted average would round that profile up to a comfortable-looking total. The cap rule won't: with eval capability below 2, the verdict is a no-go until that column is fixed, regardless of how green everything else looks. The weights tell you where to invest; the caps tell you what you're not allowed to paper over.
A note on scoring discipline before the dimension walk-through. Score each dimension against a falsifiable test an engineer can run in an afternoon, not against a stakeholder's confidence. For data, that's a real query. For eval, that's a golden set with a number. For economics, that's a cost-per-task estimate at projected volume, modelled at roughly $0.003 per 1k tokens for a mid-tier model. If a dimension can only be scored by asking how people feel about it, you're measuring intent, and intent is the thing every team over-rates.

## Dimension 1, data readiness: the score that gates everything else

Data readiness is the heaviest-weighted dimension and the one with veto power, because it's the slowest constraint to relax and the most expensive to discover late. The score isn't "do we have data"; everyone has data. It's whether the specific data a model needs exists in a labelled, queryable, lineage-traced, reasonably fresh form an engineer can pull today. The honest test is a single query: can someone on the team retrieve a representative training or retrieval set with a SQL statement against Snowflake, or a connector into the system of record, right now? If the answer is "after a cleanup project," the real score is a 1, and the right action is a data-foundation sprint, not a model deployment.
Two practical moves keep the data score honest. First, run the query test for real on the actual project's data, not on a tidy demo table; tools like dbt make the gap between "the data exists" and "the data is modelled and queryable" concrete, and that gap is usually the real finding. Second, score freshness and labelling separately from existence, because a warehouse full of stale or unlabelled rows is a 2, not a 4, no matter how many terabytes it holds. The deeper mechanics of scoring readiness honestly sit inside our pillar on [ai consulting services](/services/ai-consulting/).
The cap rule on this dimension is non-negotiable, and it's the single most useful output of the whole assessment. A data score below 2 forces the overall verdict to no-go, even if infrastructure, team, and economics all score 4. The logic is simple: an AI system trained or grounded on data you can't reliably pull is a system you can't reliably ship, and no amount of GPU or talent fixes a missing foundation. When this cap fires, the correct response isn't disappointment, it's relief that you found out before spending the model budget. The next step is a scoped data-foundation sprint, then a re-score.

## Dimension 2, infrastructure and integration readiness

Infrastructure is the dimension most readiness tools overweight, and we deliberately weight it light, because it's the one you can buy. The question we ask isn't whether you own GPUs; it's whether you can stand up an inference path, wire it into the systems where the work actually happens, and observe it in production. AWS Bedrock, Vertex AI, and Azure OpenAI give you a managed model endpoint in an afternoon, so the score here is rarely about raw compute. It's about integration surface and observability: can the AI output reach the CRM, the ticketing system, the workflow engine, and can you see what it did when it gets there?
Score this dimension on three things: an inference path (managed API or self-hosted on vLLM), an integration path into the systems of record, and an observability stack that can trace a request end to end. A team scoring 4 here has Langfuse or OpenTelemetry already wired into its services, durable orchestration through something like Temporal or Inngest, and a clear story for secrets, rate limits, and retries. A team scoring 1 can call a model from a notebook but has no path from that call to a production system anyone trusts. The integration work, not the model, is where infrastructure readiness actually lives, and it's why a high infrastructure score never rescues a low data score.

> [!NOTE] (rich block: callout)

Because infrastructure is rentable, it almost never caps the verdict, but it does change the engagement shape. A team strong everywhere except infrastructure can usually start with managed services and harden later; the integration work runs in parallel with the pilot rather than blocking it. The deeper integration patterns are worth reading alongside this; our deep-dive on [multi-agent orchestration patterns](/blog/multi-agent-orchestration-patterns/) covers the production shapes the infrastructure dimension has to support, and our [ai integration services](/services/ai-integration/) sit underneath this column.

## Dimension 3, model and evaluation capability (the dimension most checklists omit)

Here's the second opinionated take, and it's the one that separates pilots that scale from pilots that quietly die. Model-evaluation capability is the readiness dimension everyone forgets, and it's the one that predicts production survival better than any other. A team that can't write a golden-set eval will ship a demo that works and a product that drifts, because it has no way to know quality dropped until a customer tells it. So this dimension scores one thing above all: can the team measure quality as a number, on a fixed dataset, before it ships? If the honest answer is no, the eval cap fires and the verdict is no-go, the same way a low data score does.
The honest test for this dimension is whether the team can produce the artifact below, not whether it says it could. A golden-set eval is a versioned set of representative cases, a metric, and a loop that scores a candidate model against it on every change. For retrieval-augmented tasks we use Ragas; for general task evals we log runs through LangSmith or Langfuse and watch the same number every release. The snippet is the shape of the capability we're scoring: a team that can write and run this is at least a 2, a team that can't is a 1, and a team that has it wired into CI with a blocking threshold is a 3 or 4.
The reason this dimension predicts survival is that eval capability is what makes everything downstream measurable. With it, model selection is an experiment: route cheap cases to Claude Haiku 4.5 or Gemini 3.0 Flash, hard cases to Claude Opus 4.7, and let the golden set tell you the smallest model that clears the bar. Without it, model selection is a guess defended by a demo. In a 2026-Q1 Ragas run on a 1,840-document corpus, a retrieval assistant scored 88% answer correctness for a single-digit-dollar eval spend, which is exactly the kind of cheap, repeatable measurement a ready team can produce on demand and an unready one cannot. The eval-capability dimension is the one that turns the rest of the rubric from opinion into evidence.

## Dimension 4, team, process, and governance readiness

Team and process readiness scores whether the organisation can own an AI system after the demo applause fades. The dimension has three parts. Skills: can someone on the team write the eval, build the integration, and reason about retrieval quality, or is every capability rented from a vendor with no internal counterpart? Process: is there a path from idea to production that includes review, rollback, and an on-call owner, or does AI live outside the engineering process as a science project? Governance: are there clear rules on what data can train a model, where outputs can go, and who signs off, before something ships rather than after an incident?
Score this dimension against ownership, not headcount. A team of three engineers who can read a Langfuse trace, write a Ragas eval, and own an on-call rotation scores higher than a department of fifty who depend entirely on a vendor's roadmap. The Gartner AI maturity model and the McKinsey AI maturity framework are useful calibration inputs here, because they describe organisational patterns at each level, but translate their narrative levels into the same falsifiable test as every other dimension: name the person who'd get paged when the AI system misbehaves at 2am. If you can't, the process score is a 1, regardless of how many people are on the org chart.
Team and process rarely caps the verdict on its own, but a low score reshapes the engagement and the timeline. A strong team can run a pilot in parallel with hardening its process; a weak one needs the process built alongside the first project, which is slower but durable. The danger is a team that scores high on skills and low on process, because it can build something impressive that nobody can safely operate. Score ownership and governance as seriously as raw capability, because the most common post-pilot failure isn't that the model couldn't do the task, it's that the organisation couldn't run what the pilot built.

## Dimension 5, economic readiness: can you afford the operate-phase cost curve?

Here's the third opinionated take, the one that sinks more projects than any model failure. Economic readiness isn't whether you can afford to build the AI system; it's whether you can afford to operate it at the volume success creates. The build cost is a one-time number a CFO can sign. The operate cost is a curve that bends with adoption, and successful automations routinely drive two-to-three times the spec'd volume, so it bends faster than anyone budgeted. A team is economically ready when it has modelled steady-state cost-per-task at real volume before committing, and unready when it has budgeted only the build and treated inference as free.
Score this dimension on whether the team can produce a defensible cost-per-task number, not on whether it has budget headroom. The model needs four inputs: task volume per month, the cost-per-task baseline, the cost-per-task after automation, and the build plus ongoing cost. Inference tokens usually dominate the per-task cost at scale, which is why model right-sizing against the golden-set eval is the highest-leverage economic lever a team has. The table breaks down where the operate-phase spend actually lands, and the line that surprises teams is almost always the token line, because it's the one the pilot made look trivial.
Economic readiness rarely caps the verdict outright, but it changes which project you start with. A team that's ready everywhere except economics should sequence a high-volume workflow whose unit savings pay back the build in weeks, not a low-volume showcase that takes a year on the same per-task math. ROI timing is dominated by the volume term, so the economics dimension and the data-readiness dimension have to be read together: the right first project is the one where the data already exists and the volume justifies the operate-phase curve. Model the curve at roughly $0.003 per 1k tokens times real volume during the assessment, and the month-six finance review is a celebration instead of an inquest.

## The AI readiness assessment scoring rubric: weights, the 0-4 scale, and go/no-go thresholds

Here's the artifact the rest of the search withholds: the full AI readiness assessment as a scored rubric you can run yourself. Score each dimension 0-to-4 against its falsifiable test, multiply by the weight, and sum for a weighted total out of 4. Then apply the two cap rules, because the weighted total alone hides the failure modes that kill projects. The weights encode where programs die; the caps encode what no strength elsewhere can rescue. This single table is what turns a self-assessment from a confidence survey into a defensible decision.
The two cap rules are what make this rubric honest, and they're the part vendor tools omit because they can return a no. Cap one: if data readiness scores below 2, the overall verdict is no-go regardless of the weighted total, because you can't ship a model on data you can't pull. Cap two: if model and eval capability scores below 2, the verdict is no-go, because a team that can't measure quality will ship drift it can't see. A weighted total can look healthy while one of these columns sits at a 1; the caps stop that healthy-looking total from greenlighting a project that's structurally unready. Apply the caps first, then read the weighted total.
Run that table and you've replaced a confidence survey with a scored decision. The weights make the trade-offs explicit, the caps stop the two structural failures from hiding behind a healthy average, and the thresholds turn the score into one of four honest verdicts: no-go, wait, pilot, or go. A team scoring 2.4 with both caps clear isn't being told it's behind; it's being told to pilot one scoped project with a hard eval gate, which is the most useful thing an assessment can say. The score isn't a grade. It's an instruction set for what to do next.

## From score to action: what to do at each readiness level

A score is only worth the action it triggers, so we map each verdict to a concrete next move rather than a feeling. The point of the rubric isn't to rank you; it's to tell you the cheapest thing to do next given where the weakness sits. Because the weights and caps localise the problem, the action is rarely "do everything"; it's almost always "fix this one dimension, then re-score." That's the difference between an assessment that informs a decision and a maturity badge that just describes a state.
Let's walk the verdicts in order, the way we'd talk a steering committee through them. A no-go from a fired cap means a scoped foundation sprint: a 4-to-6 week data-foundation effort if data is the gap, or an eval-capability sprint to build a golden set and wire a gate if the eval column is the problem, then a re-score. A wait verdict means multiple dimensions are immature and a pilot would mostly debug foundations, so fix the two with the worst weighted impact first. A pilot verdict means you're ready to test one scoped project with a hard eval gate and a walk-away clause on the weakest dimension. A go verdict means build and operate, with the cost curve already modelled. Once you've scored readiness, the next artifact is the sequencing plan; our [AI strategy roadmap template](/blog/ai-roadmap-template/) picks up exactly where a pilot-or-go verdict leaves off, and our deep-dive on [AI automation solutions](/blog/ai-automation-solutions-buyers-guide/) covers the volume-heavy projects that score best on economics.
One closing note on action before the FAQ. The verdicts only work if the scores are taken against falsifiable tests and frozen before the work starts, the same discipline the eval gate enforces inside a pilot. A data score talked up from a 1 to a 3 in a steering meeting is no score; a cost-per-task estimate that ignores the operate-phase curve is no estimate. The value of an engineering-led AI readiness assessment is that it makes the uncomfortable verdicts cheap by reaching them early and numerically, and our [MLOps services](/services/mlops/) are where the eval-and-monitoring capability the rubric scores actually gets built.

## FAQ: AI readiness assessment, in the buyer's vocabulary

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/