AI readiness assessment: a vendor-neutral scoring rubric
A vendor-neutral AI readiness assessment: a five-dimension scoring rubric (data, infrastructure, model, team, economics) with weights and honest go/no-go thresholds.
Most teams who run an AI readiness assessment score themselves a point too high on every dimension, and the reason is structural: the people filling in the scorecard are the people whose budget depends on the answer being yes. The vendor tools that dominate this search are worse, because they're built to return a green light that leads to a product. So this is the artifact the rest of the page-one results withhold: a vendor-neutral AI readiness assessment you can run in an afternoon, scored across five weighted dimensions on a 0-to-4 maturity scale, with honest go/no-go thresholds that are allowed to tell you to wait. The rubric is the page. Everything else is the reasoning behind the weights.
We've written this for the VP of Strategy, the CDO, and the CTO who own the AI budget and have to defend whether to start, pilot, or hold in front of a board. The register is engineering-led on purpose, because the dimensions that actually predict whether an AI project survives production are technical ones the strategy decks skip. We're makers, not a slide-deck shop, so the bias throughout is toward cost-per-task economics and measurable capability over transformation narrative. Each of the five dimensions gets a concrete 0-to-4 descriptor set, a weight, and a gate. Where a number anchors a decision you'll see the number; where a tool earns a mention it gets named. The scoring rubric below is the same one we walk a steering committee through, minus the engagement pricing that doesn't belong on a public page.
AI readiness assessment in one paragraph, and why most teams score themselves too high
Working definition. An AI readiness assessment is a scored measurement of whether an organisation can build and operate an AI system that survives contact with production, taken across the five dimensions that actually predict survival: data readiness above all, then model and evaluation capability, then infrastructure, team, and economics. It is not a maturity badge that puts you on a 1-to-5 curve and stops, and it is not a vendor questionnaire that grades your appetite to buy. The output that matters is a weighted score with a go/no-go verdict an engineering leader can defend, and that's what this guide ships.
Here's the first opinionated take, the one that explains nearly every assessment that scores a team ready and then watches the project stall. Most AI readiness assessments overweight infrastructure and underweight data, which is exactly backwards. You can rent infrastructure in an afternoon: AWS Bedrock, Vertex AI, and Azure OpenAI all stand up a usable inference endpoint before lunch, and a GPU is a credit-card transaction. What you cannot rent is eighteen months of clean and labelled event data in the queryable shape a model can read. So an honest rubric puts the heaviest weight on data readiness and lets a low data score cap the overall verdict, no matter how good the infrastructure column looks. The tools at the top of this search invert that, because infrastructure is the column they sell.
The second reason teams over-score is that nobody runs the assessment against a falsifiable test. "We have good data" becomes a 4 in the self-assessment and a 1 the moment an engineer tries to pull it with a query. The fix runs through every dimension below: score against what an engineer can do today, not what the org believes it could do with a quarter of cleanup. For the procurement context that sits underneath this, our generative AI services buyer's guide covers how to buy once the assessment says you're ready.
What an AI readiness assessment measures (and the dimensions vendor tools skip)
The category boundary that costs buyers the most is the one between an assessment that measures capability and a questionnaire that measures intent. The interactive wizards from the big platforms grade how ready you are to adopt their stack; the consultancy definitions grade where you sit on a generic maturity curve. Neither tells an engineering team whether the data exists, whether anyone can write a golden-set eval, or whether the unit economics survive at real volume. Here's how we draw the line, because the distinction determines whether you've measured readiness or measured enthusiasm.
| Vendor readiness tool (what the SERP sells) | Engineering-led AI readiness assessment | |
|---|---|---|
| What it scores | Appetite and stated ambition to buy | Measurable capability an engineer can demonstrate today |
| Data dimension | A checkbox: do you have data, yes or no | 0-to-4 on queryability, labels, lineage, and freshness |
| Model and eval capability | Rarely measured at all | Scored on whether the team can write a golden-set eval |
| Economics | ROI narrative, hours saved | Cost-per-task at real volume, per-token math |
| Vendor stance | Tilted toward one cloud or model provider | Vendor-neutral; the rubric outlives any single tool |
| Possible verdict | Always some flavour of ready (next step is the product) | Start, pilot, or wait, with the wait case fully respected |
The structural difference is the cap rule. A real assessment has dimensions that can veto the overall verdict regardless of the weighted total, because some weaknesses aren't averaged away by strengths elsewhere. The Gartner AI maturity model and the McKinsey AI maturity framework are useful inputs to the team-and-process dimension, and the Cisco AI Readiness Index is a reasonable lens on the infrastructure column, but they're diagnostics that report a position, not rubrics that return a go/no-go. That's the gap this AI readiness assessment closes: it converts a position on a curve into a defensible decision with thresholds attached.
One more boundary worth stating plainly. A readiness assessment is not a tool-selection exercise. Whether you eventually run Claude Sonnet 4.6 or GPT-5 mini, store vectors in pgvector or Pinecone, or evaluate with Ragas or LangSmith, those are downstream implementation choices, not readiness signals. A team that's ready can pick tools; a team that isn't ready can't be made ready by the right tool. Keep that arrow pointing the right way and the scoring stays honest: capability first, vendors later.
The five dimensions of AI readiness: data, infrastructure, model capability, team, economics
Every assessment we run scores the same five dimensions, weighted by how strongly each one predicts production survival rather than how easy it is to measure. Data readiness carries the most weight because it's the slowest to fix and the most often faked. Model and evaluation capability comes next, because a team that can't measure quality ships drift. Infrastructure is third and deliberately light, because it's the dimension you can buy. Team and process, then economics, round it out. The weights aren't arbitrary; they encode where programs actually die.
Read the scorecard as a shape, not a single number. The team in the figure has strong data and economics columns but a model-and-eval score of 1, which is the most dangerous profile there is: it ships a convincing demo and a product that drifts, because nobody can measure quality before release. A weighted average would round that profile up to a comfortable-looking total. The cap rule won't: with eval capability below 2, the verdict is a no-go until that column is fixed, regardless of how green everything else looks. The weights tell you where to invest; the caps tell you what you're not allowed to paper over.
A note on scoring discipline before the dimension walk-through. Score each dimension against a falsifiable test an engineer can run in an afternoon, not against a stakeholder's confidence. For data, that's a real query. For eval, that's a golden set with a number. For economics, that's a cost-per-task estimate at projected volume, modelled at roughly $0.003 per 1k tokens for a mid-tier model. If a dimension can only be scored by asking how people feel about it, you're measuring intent, and intent is the thing every team over-rates.
Dimension 1, data readiness: the score that gates everything else
Data readiness is the heaviest-weighted dimension and the one with veto power, because it's the slowest constraint to relax and the most expensive to discover late. The score isn't "do we have data"; everyone has data. It's whether the specific data a model needs exists in a labelled, queryable, lineage-traced, reasonably fresh form an engineer can pull today. The honest test is a single query: can someone on the team retrieve a representative training or retrieval set with a SQL statement against Snowflake, or a connector into the system of record, right now? If the answer is "after a cleanup project," the real score is a 1, and the right action is a data-foundation sprint, not a model deployment.
| Score | What it looks like in practice | The query test |
|---|---|---|
| 0 — Absent | Data is in PDFs, spreadsheets, and people's heads; no system of record | No query is possible at all |
| 1 — Raw | Data exists in source systems but is unmodelled, unlabelled, and inconsistent | A query returns rows, but joining and labelling is a multi-month project |
| 2 — Queryable | Modelled in a warehouse with dbt, queryable, but labels and freshness are uneven | An engineer can pull a usable set in a day with caveats |
| 3 — Governed | Labelled, lineage-traced, access-controlled, refreshed on a known cadence | A representative set is one well-understood query away |
| 4 — Production-grade | Feature/retrieval pipelines exist; data is versioned and monitored for drift | Pulling training or retrieval data is a routine, automated operation |
Two practical moves keep the data score honest. First, run the query test for real on the actual project's data, not on a tidy demo table; tools like dbt make the gap between "the data exists" and "the data is modelled and queryable" concrete, and that gap is usually the real finding. Second, score freshness and labelling separately from existence, because a warehouse full of stale or unlabelled rows is a 2, not a 4, no matter how many terabytes it holds. The deeper mechanics of scoring readiness honestly sit inside our pillar on ai consulting services.
The cap rule on this dimension is non-negotiable, and it's the single most useful output of the whole assessment. A data score below 2 forces the overall verdict to no-go, even if infrastructure, team, and economics all score 4. The logic is simple: an AI system trained or grounded on data you can't reliably pull is a system you can't reliably ship, and no amount of GPU or talent fixes a missing foundation. When this cap fires, the correct response isn't disappointment, it's relief that you found out before spending the model budget. The next step is a scoped data-foundation sprint, then a re-score.
Dimension 2, infrastructure and integration readiness
Infrastructure is the dimension most readiness tools overweight, and we deliberately weight it light, because it's the one you can buy. The question we ask isn't whether you own GPUs; it's whether you can stand up an inference path, wire it into the systems where the work actually happens, and observe it in production. AWS Bedrock, Vertex AI, and Azure OpenAI give you a managed model endpoint in an afternoon, so the score here is rarely about raw compute. It's about integration surface and observability: can the AI output reach the CRM, the ticketing system, the workflow engine, and can you see what it did when it gets there?
Score this dimension on three things: an inference path (managed API or self-hosted on vLLM), an integration path into the systems of record, and an observability stack that can trace a request end to end. A team scoring 4 here has Langfuse or OpenTelemetry already wired into its services, durable orchestration through something like Temporal or Inngest, and a clear story for secrets, rate limits, and retries. A team scoring 1 can call a model from a notebook but has no path from that call to a production system anyone trusts. The integration work, not the model, is where infrastructure readiness actually lives, and it's why a high infrastructure score never rescues a low data score.
Because infrastructure is rentable, it almost never caps the verdict, but it does change the engagement shape. A team strong everywhere except infrastructure can usually start with managed services and harden later; the integration work runs in parallel with the pilot rather than blocking it. The deeper integration patterns are worth reading alongside this; our deep-dive on multi-agent orchestration patterns covers the production shapes the infrastructure dimension has to support, and our ai integration services sit underneath this column.
Dimension 3, model and evaluation capability (the dimension most checklists omit)
Here's the second opinionated take, and it's the one that separates pilots that scale from pilots that quietly die. Model-evaluation capability is the readiness dimension everyone forgets, and it's the one that predicts production survival better than any other. A team that can't write a golden-set eval will ship a demo that works and a product that drifts, because it has no way to know quality dropped until a customer tells it. So this dimension scores one thing above all: can the team measure quality as a number, on a fixed dataset, before it ships? If the honest answer is no, the eval cap fires and the verdict is no-go, the same way a low data score does.
The honest test for this dimension is whether the team can produce the artifact below, not whether it says it could. A golden-set eval is a versioned set of representative cases, a metric, and a loop that scores a candidate model against it on every change. For retrieval-augmented tasks we use Ragas; for general task evals we log runs through LangSmith or Langfuse and watch the same number every release. The snippet is the shape of the capability we're scoring: a team that can write and run this is at least a 2, a team that can't is a 1, and a team that has it wired into CI with a blocking threshold is a 3 or 4.
# Golden-set eval — the artifact that decides Dimension 3.
# A team that can write, version, and run this scores >= 2 on model/eval capability.
from ragas import evaluate
from ragas.metrics import answer_correctness
import json
GOLDEN_SET = "golden_v1.jsonl" # ~500 representative cases, versioned in git
GATE_THRESHOLD = 0.85 # agreed before the pilot, frozen while it runs
def score_model(model_fn) -> dict:
cases = [json.loads(line) for line in open(GOLDEN_SET)]
preds = [model_fn(c["input"]) for c in cases]
result = evaluate(
dataset={
"question": [c["input"] for c in cases],
"answer": preds,
"ground_truth": [c["expected"] for c in cases],
},
metrics=[answer_correctness],
)["answer_correctness"]
verdict = "GO" if result >= GATE_THRESHOLD else "NO-GO (not ready)"
return {"score": round(result, 3), "threshold": GATE_THRESHOLD,
"n_cases": len(cases), "verdict": verdict}
if __name__ == "__main__":
from candidate import answer # the model under assessment
print(json.dumps(score_model(answer), indent=2)) The reason this dimension predicts survival is that eval capability is what makes everything downstream measurable. With it, model selection is an experiment: route cheap cases to Claude Haiku 4.5 or Gemini 3.0 Flash, hard cases to Claude Opus 4.7, and let the golden set tell you the smallest model that clears the bar. Without it, model selection is a guess defended by a demo. In a 2026-Q1 Ragas run on a 1,840-document corpus, a retrieval assistant scored 88% answer correctness for a single-digit-dollar eval spend, which is exactly the kind of cheap, repeatable measurement a ready team can produce on demand and an unready one cannot. The eval-capability dimension is the one that turns the rest of the rubric from opinion into evidence.
Dimension 4, team, process, and governance readiness
Team and process readiness scores whether the organisation can own an AI system after the demo applause fades. The dimension has three parts. Skills: can someone on the team write the eval, build the integration, and reason about retrieval quality, or is every capability rented from a vendor with no internal counterpart? Process: is there a path from idea to production that includes review, rollback, and an on-call owner, or does AI live outside the engineering process as a science project? Governance: are there clear rules on what data can train a model, where outputs can go, and who signs off, before something ships rather than after an incident?
Score this dimension against ownership, not headcount. A team of three engineers who can read a Langfuse trace, write a Ragas eval, and own an on-call rotation scores higher than a department of fifty who depend entirely on a vendor's roadmap. The Gartner AI maturity model and the McKinsey AI maturity framework are useful calibration inputs here, because they describe organisational patterns at each level, but translate their narrative levels into the same falsifiable test as every other dimension: name the person who'd get paged when the AI system misbehaves at 2am. If you can't, the process score is a 1, regardless of how many people are on the org chart.
Team and process rarely caps the verdict on its own, but a low score reshapes the engagement and the timeline. A strong team can run a pilot in parallel with hardening its process; a weak one needs the process built alongside the first project, which is slower but durable. The danger is a team that scores high on skills and low on process, because it can build something impressive that nobody can safely operate. Score ownership and governance as seriously as raw capability, because the most common post-pilot failure isn't that the model couldn't do the task, it's that the organisation couldn't run what the pilot built.
Dimension 5, economic readiness: can you afford the operate-phase cost curve?
Here's the third opinionated take, the one that sinks more projects than any model failure. Economic readiness isn't whether you can afford to build the AI system; it's whether you can afford to operate it at the volume success creates. The build cost is a one-time number a CFO can sign. The operate cost is a curve that bends with adoption, and successful automations routinely drive two-to-three times the spec'd volume, so it bends faster than anyone budgeted. A team is economically ready when it has modelled steady-state cost-per-task at real volume before committing, and unready when it has budgeted only the build and treated inference as free.
Score this dimension on whether the team can produce a defensible cost-per-task number, not on whether it has budget headroom. The model needs four inputs: task volume per month, the cost-per-task baseline, the cost-per-task after automation, and the build plus ongoing cost. Inference tokens usually dominate the per-task cost at scale, which is why model right-sizing against the golden-set eval is the highest-leverage economic lever a team has. The table breaks down where the operate-phase spend actually lands, and the line that surprises teams is almost always the token line, because it's the one the pilot made look trivial.
| Cost component | Driver | Typical shape per task | Lever to pull |
|---|---|---|---|
| Inference tokens | Model tier x token count x volume | Dominant line at scale | Right-size the model against the eval bar |
| Retrieval query | Vector store reads, reranking | Small but volume-sensitive | pgvector over a managed tier when Postgres already exists |
| Orchestration compute | Workflow runs, retries | Usually rounding error | Cap LLM-call retries at 2-3, dead-letter the rest |
| Observability + on-call | Traces, eval re-runs, drift watch | Amortised across all tasks | Budget it during the assessment, not as an afterthought |
| Index maintenance | Reindexing cadence | Periodic compute + engineer time | Weekly for active corpora, monthly for archives |
Economic readiness rarely caps the verdict outright, but it changes which project you start with. A team that's ready everywhere except economics should sequence a high-volume workflow whose unit savings pay back the build in weeks, not a low-volume showcase that takes a year on the same per-task math. ROI timing is dominated by the volume term, so the economics dimension and the data-readiness dimension have to be read together: the right first project is the one where the data already exists and the volume justifies the operate-phase curve. Model the curve at roughly $0.003 per 1k tokens times real volume during the assessment, and the month-six finance review is a celebration instead of an inquest.
The AI readiness assessment scoring rubric: weights, the 0-4 scale, and go/no-go thresholds
Here's the artifact the rest of the search withholds: the full AI readiness assessment as a scored rubric you can run yourself. Score each dimension 0-to-4 against its falsifiable test, multiply by the weight, and sum for a weighted total out of 4. Then apply the two cap rules, because the weighted total alone hides the failure modes that kill projects. The weights encode where programs die; the caps encode what no strength elsewhere can rescue. This single table is what turns a self-assessment from a confidence survey into a defensible decision.
| Dimension | Weight | Score 0-1 (not ready) | Score 2 (borderline) | Score 3-4 (ready) |
|---|---|---|---|---|
| Data readiness (CAP) | 0.30 | Unmodelled, unlabelled, or not queryable without a cleanup project | Queryable in a warehouse; labels and freshness uneven | Labelled, lineage-traced, refreshed; a representative set is one query away |
| Model/eval capability (CAP) | 0.25 | No golden set; quality judged by demo; drift invisible | A versioned golden set and a number per release | Eval gate in CI; production drift watched via Langfuse |
| Infrastructure + integration | 0.15 | Model callable from a notebook only; no path to production | Managed endpoint wired to one system; partial observability | Durable orchestration, full tracing, secrets and retries handled |
| Team, process, governance | 0.15 | AI is a science project; no owner, rollback, or data rules | Some skills internal; process and governance ad hoc | Named on-call owner, deploy pipeline, clear data-use rules |
| Economic readiness | 0.15 | Only the build is budgeted; inference treated as free | Rough cost-per-task estimate, volume assumptions soft | Defensible cost-per-task at real volume, modelled before commit |
The two cap rules are what make this rubric honest, and they're the part vendor tools omit because they can return a no. Cap one: if data readiness scores below 2, the overall verdict is no-go regardless of the weighted total, because you can't ship a model on data you can't pull. Cap two: if model and eval capability scores below 2, the verdict is no-go, because a team that can't measure quality will ship drift it can't see. A weighted total can look healthy while one of these columns sits at a 1; the caps stop that healthy-looking total from greenlighting a project that's structurally unready. Apply the caps first, then read the weighted total.
# AI readiness assessment scorer. Weights encode where programs die;
# the two cap rules encode what no other strength can rescue.
WEIGHTS = {
"data": 0.30, # heaviest: you can't rent clean, queryable data
"eval": 0.25, # measure quality before shipping, or ship drift
"infra": 0.15, # rentable, so weighted light
"team": 0.15, # ownership and rollback, not headcount
"econ": 0.15, # operate-phase cost-per-task, not build cost
}
CAPS = ("data", "eval") # below 2 on either forces a no-go
def assess(scores: dict) -> dict:
for dim in CAPS:
if scores[dim] < 2:
return {"verdict": "NO-GO", "reason": f"{dim} cap fired"}
total = sum(scores[d] * w for d, w in WEIGHTS.items())
if total >= 3.0: verdict = "GO"
elif total >= 2.0: verdict = "PILOT"
else: verdict = "WAIT"
return {"verdict": verdict, "weighted_total": round(total, 2)}
if __name__ == "__main__":
# strong data + econ, but eval capability missing -> cap fires
print(assess({"data": 3, "eval": 1, "infra": 3, "team": 3, "econ": 4})) # The rubric as data, so the weights are auditable and the caps explicit.
weights:
data: 0.30 # cap dimension
eval: 0.25 # cap dimension
infra: 0.15
team: 0.15
econ: 0.15
caps:
- data # score < 2 -> NO-GO regardless of weighted total
- eval # score < 2 -> NO-GO regardless of weighted total
thresholds:
go: 3.0 # build and operate
pilot: 2.0 # one scoped project with a hard eval gate
wait: 0.0 # fix the two worst dimensions, then re-score
| Weighted total (after caps) | Verdict | What it means | The next move |
|---|---|---|---|
| Either cap fired (data or eval < 2) | NO-GO | A structural gap no other strength can offset | A 4-6 week data-foundation or eval sprint, then re-score |
| Below 2.0 | WAIT | Multiple dimensions are immature; a pilot would mostly debug foundations | Fix the two lowest-weighted-impact dimensions first, then re-score |
| 2.0 to 2.9 | PILOT | Ready to test on one scoped project with a hard eval gate | Run a 4-6 week pilot with a walk-away clause on the lowest dimension |
| 3.0 and above | GO | Ready to build and operate, with the cost curve already modelled | Sequence by data-readiness and volume; ship the strongest project first |
Run that table and you've replaced a confidence survey with a scored decision. The weights make the trade-offs explicit, the caps stop the two structural failures from hiding behind a healthy average, and the thresholds turn the score into one of four honest verdicts: no-go, wait, pilot, or go. A team scoring 2.4 with both caps clear isn't being told it's behind; it's being told to pilot one scoped project with a hard eval gate, which is the most useful thing an assessment can say. The score isn't a grade. It's an instruction set for what to do next.
From score to action: what to do at each readiness level
A score is only worth the action it triggers, so we map each verdict to a concrete next move rather than a feeling. The point of the rubric isn't to rank you; it's to tell you the cheapest thing to do next given where the weakness sits. Because the weights and caps localise the problem, the action is rarely "do everything"; it's almost always "fix this one dimension, then re-score." That's the difference between an assessment that informs a decision and a maturity badge that just describes a state.
Let's walk the verdicts in order, the way we'd talk a steering committee through them. A no-go from a fired cap means a scoped foundation sprint: a 4-to-6 week data-foundation effort if data is the gap, or an eval-capability sprint to build a golden set and wire a gate if the eval column is the problem, then a re-score. A wait verdict means multiple dimensions are immature and a pilot would mostly debug foundations, so fix the two with the worst weighted impact first. A pilot verdict means you're ready to test one scoped project with a hard eval gate and a walk-away clause on the weakest dimension. A go verdict means build and operate, with the cost curve already modelled. Once you've scored readiness, the next artifact is the sequencing plan; our AI strategy roadmap template picks up exactly where a pilot-or-go verdict leaves off, and our deep-dive on AI automation solutions covers the volume-heavy projects that score best on economics.
| Vendor tool's next step | What an honest verdict actually triggers | |
|---|---|---|
| No-go (cap fired) | Rarely possible; the tool returns ready | A scoped data or eval foundation sprint, then a re-score |
| Wait (total < 2.0) | Book a sales call | Fix the two worst dimensions, hold the budget, re-score in a quarter |
| Pilot (2.0-2.9) | Buy the platform | One scoped project, a golden-set gate, a walk-away clause |
| Go (3.0+) | Buy the platform, faster | Build and operate; sequence by data-readiness and volume |
One closing note on action before the FAQ. The verdicts only work if the scores are taken against falsifiable tests and frozen before the work starts, the same discipline the eval gate enforces inside a pilot. A data score talked up from a 1 to a 3 in a steering meeting is no score; a cost-per-task estimate that ignores the operate-phase curve is no estimate. The value of an engineering-led AI readiness assessment is that it makes the uncomfortable verdicts cheap by reaching them early and numerically, and our MLOps services are where the eval-and-monitoring capability the rubric scores actually gets built.
FAQ: AI readiness assessment, in the buyer's vocabulary
What is an AI readiness assessment in plain language?
It's a scored measurement of whether your organisation can build, ship, and operate an AI system that survives production, taken across five dimensions: data readiness, infrastructure and integration, model and evaluation capability, team and process, and economics. Each dimension scores 0-to-4 against a falsifiable test an engineer can run, weighted by how strongly it predicts survival, with two cap rules that force a no-go if data or eval capability is too weak. The output is one of four verdicts, go, pilot, wait, or no-go, not a maturity badge.
What does an AI readiness assessment ROI calculation look like?
The AI readiness assessment ROI math runs through the economics dimension and has four inputs: monthly task volume, the cost-per-task baseline, the cost-per-task after automation, and the build plus ongoing cost. Monthly saving is volume times the unit-cost difference; payback is build cost divided by monthly saving. ROI timing is dominated by volume, so a high-volume workflow pays back in weeks while a low-volume one can take a year on the same unit savings. Model the after-automation unit cost honestly at roughly $0.003 per 1k tokens for a mid-tier model times real volume, and remember the operate-phase token bill, not the build, is usually the dominant line.
How is AI readiness assessment cost structured, and what should I expect?
On AI readiness assessment cost, the public answer is engagement shape rather than dollar tiers: a fixed-scope assessment with a walk-away clause, and where a gap is found, a 4-to-6 week data-foundation or eval-capability sprint before any model work. The technical economics that drive the decision are the per-task numbers: budget steady-state cost at roughly $0.003 per 1k tokens for a mid-tier model times your real volume, and weight the operate-phase curve heavily, because successful automations drive two-to-three times the spec'd volume and the inference bill bends with adoption.
What does the AI readiness assessment evaluation dimension measure?
The AI readiness assessment evaluation dimension measures whether the team can score model quality as a number on a fixed, versioned golden set before it ships, rather than judging quality by demo. For retrieval-augmented tasks we use Ragas and log runs through LangSmith or Langfuse; a team that can write and run that loop scores at least a 2, one that has it wired into CI with a blocking threshold scores a 3 or 4, and one that can't scores a 1, which fires the cap rule. This is the dimension most checklists omit, and it predicts production survival better than any other.
Can you give concrete AI readiness assessment examples of high and low scores?
The clearest AI readiness assessment examples come from the cap rules. A team with strong infrastructure on AWS Bedrock and a generous budget but data sitting unlabelled across spreadsheets scores a 1 on data, fires the cap, and gets a no-go despite the green columns; the right move is a data-foundation sprint. By contrast, three engineers who can pull data with one query, write a Ragas golden set, and own an on-call rotation can score a 3 overall and earn a go even with modest infrastructure, because the dimensions that gate survival, data and eval, are strong. Score against what an engineer can demonstrate today, and these examples separate cleanly.
Is this AI readiness assessment guide a substitute for an engagement?
This AI readiness assessment guide ships the artifact, the five-dimension weighted rubric with its 0-to-4 scale, cap rules, and go/no-go thresholds, so you can score yourself in an afternoon. What an engagement adds is the discipline of doing it under accountability: an outside team that scores data-readiness against a real query rather than a stakeholder's confidence, writes the eval as code, and is willing to return a wait or no-go verdict when a vendor tool wouldn't. The rubric is the map; the engagement is having makers run it with you and own the foundation sprint if a cap fires.
Why weight data readiness so heavily over infrastructure?
Because infrastructure is rentable and data isn't. You can stand up an inference endpoint on AWS Bedrock, Vertex AI, or Azure OpenAI in an afternoon, but you can't rent eighteen months of clean and labelled event data in queryable shape. Most readiness tools overweight infrastructure because it's the column they sell; an honest rubric puts the heaviest weight, 0.30 of the total, on data readiness and lets a low data score cap the entire verdict. The same logic applies to model and eval capability, the second-heaviest weight, because a team that can't measure quality ships drift no infrastructure can catch.
Ready to score your AI readiness against a rubric that's allowed to say wait?
We run the same five-dimension assessment, apply the cap rules, and hand back a scored go/no-go verdict with a concrete next move, whether that's a data-foundation sprint or a scoped pilot with an eval gate. Makers, not management consultants.