ai tools for business · the stack we ship

AI tools for business,
frameworks, models, observability, picked per workflow.

The 2026 AI tech stack we actually ship in: across 8 service pillars, model-agnostic, eval-tested. Claude · GPT · Gemini · open-weights for the model layer. LangGraph · OpenAI Agents SDK · CrewAI for orchestration. pgvector · Pinecone · Langfuse · Braintrust for data and observability. We pick. Listicle authors can't.

See the 5-layer stack

Default

Claude Sonnet 4.6

Anthropic

200K context $3 / M in · $15 / M out

Long-context · tool use · production default

GPT-5

OpenAI

128K context Mainline tier · best ratio

Realtime voice · vision · ecosystem maturity

Gemini 2.0 / Open-weights

Google · Llama · Qwen

1M+ · self-hosted Niche · infra-only · per-workload pick

Massive context · sovereign · cost-floor

Definition

What does this AI stack reference cover?

The GetWidget AI tools and frameworks reference is an operator-grade map of the production AI stack in 2026. Foundation models include Claude Sonnet 4.6, Haiku 4.5, GPT-5, GPT-5-mini, and open-weight options like Llama and Mistral. Retrieval uses pgvector, Pinecone, and Algolia for hybrid lexical-plus-vector search. Reranking pairs BAAI bge-reranker-large with Cohere Rerank or Voyage rerank. Orchestration spans LangGraph, CrewAI, AutoGen, n8n, and Temporal. Evaluation and observability cover Langfuse, Helicone, Ragas, and Braintrust. Voice ships on the OpenAI Realtime API (gpt-realtime-2), Whisper-large-v3, and Twilio Voice for telephony. Guardrails include Llama Guard 3, NeMo Guardrails, and Lakera. Deployment runs through AWS Bedrock with PrivateLink for regulated workloads, Cloudflare Workers for edge inference, or Azure OpenAI for enterprise Microsoft stacks. Each tool is named with its version, role on the production critical path, and the workflow it best supports. Maintained as a living reference, not a vendor-list aggregator.

AI service pillars we ship; each tool below maps to one · (2026-Q1)

Daily

we use Claude Code + OpenAI Codex internally · (2026-Q1)

Fixed-fee

audit picks the stack per workflow before you commit

30 days

first workflow live on the chosen stack

ai tools · best ai tools · ai model comparison

Every other page on this SERP is a list.
This one tells you which to use.

The top 10 results for ai tools, ai frameworks, and ai tools for business are 10 numbered lists. None of them tell you which tool wins for which job, which LLM behind it, or how much it costs to ship. We've made these picks 8 times for real buyers across our AI service pillars, so this hub is the synthesis, not another listicle.

Decision support, not a top-10 list

Every other page on this SERP is a numbered list of AI tools. This one tells you which framework wins for which job, which LLM behind it, and which of our 8 service pillars to engage if you want it shipped.

Model-agnostic, openly

We don't sell a tool. We pick across Claude, GPT, Gemini, and open-weights per workload, on your eval data, not a partner badge. Sibling pillars at /services/claude-development/ and /services/openai-development/ prove it.

Priced on this page, not behind a contact form

discovery audit, fixed-bid pilot, monthlynth continuous. Listicles avoid pricing because they don't sell the build. We do — so the number is on the page.

modern ai stack · ai infrastructure · ai tech stack

The AI stack we ship in,
five layers. Pick any to open.

Every enterprise ai platform on the SERP is one of these five layers in a single product wrapper. We treat the layers separately because that's how decisions actually get made in production: pick a model, pick an orchestration framework, pick a data layer, pick observability, pick infra. The best ai tools in 2026 don't live in one product; they live across these five layers, picked per workflow. Each layer below opens to the tools we name, the failure modes we've hit in production, and our default unless there's a reason not to.

tools we name
- Claude Sonnet 4.6
- Claude Haiku 4.5
- GPT-5
- GPT-5-mini
- Gemini 2.0
- Llama 3.3 (self-hosted)
- Qwen 2.5
- Mistral
- XGBoost · scikit (when an LLM is the wrong tool)
production failure modes
- Single-vendor lock-in: an outage takes the whole product down with no failover path.
- Picked the flagship for everything: classifier workflow costs 12× what it should.
- Default to LLM where a rule engine or classical ML would beat it on cost and quality.
our default · unless reason not to
Sonnet 4.6 for long-context tool use and production reasoning; Haiku 4.5 or GPT-5-mini for high-volume narrow tasks; GPT-5 where the Realtime API, vision pipeline, or Codex matters; open-weights for sovereign data and cost-floor at scale. The pick is a per-workflow decision based on the eval set, not a vendor badge.
tools we name
- LangGraph
- OpenAI Agents SDK
- CrewAI
- AutoGen
- Anthropic computer use
- Hand-rolled Python tool loops
- Temporal / Inngest (for the durable workflow shell)
production failure modes
- Reach for LangGraph on a 3-step workflow: the framework outweighs the problem.
- Agent loops on the same tool 12 times before hitting a max-step cutoff.
- Tool-use schema drifts from the model's actual output, causing silent JSON parse failures.
our default · unless reason not to
Plain Python tool loops for anything under ~6 steps. LangGraph when the graph branches or you need durable replay. OpenAI Agents SDK when the workflow is GPT-only and the team wants vendor-native ergonomics. CrewAI for fast prototyping of role-based multi-agent scenes. AutoGen rarely — we've found the others ship cleaner.
tools we name
- pgvector (Postgres)
- Pinecone
- Qdrant
- Weaviate
- LlamaIndex
- LangChain index
- Unstructured.io
- Cohere Rerank
- BM25 hybrid retrieval
production failure modes
- Embeddings re-index on every deploy: the vector-DB bill quietly balloons.
- Chunking ignores document structure, so answer quality plateaus and won't improve.
- No reranker — top-k retrieval returns plausible-but-wrong context.
our default · unless reason not to
pgvector on your existing Postgres for ≤2M chunks (operationally simpler than another vendor). Pinecone or Qdrant past that scale. LlamaIndex for the loader / chunker layer; LangChain only where its index abstractions actually save time. Hybrid BM25 + dense retrieval + Cohere Rerank by default — the quality lift is bigger than picking a fancier embedding model.
tools we name
- Langfuse
- Braintrust
- Arize Phoenix
- Helicone
- Inspect AI
- Hand-rolled pytest eval harnesses
- OpenTelemetry · GenAI semantic conventions
production failure modes
- No eval set: "works on my prompt" ships to prod and silently degrades over the next model swap.
- Trace data lives nowhere. The first you hear about a regression is a customer email.
- Cost telemetry isn't wired in; you find out about a runaway prompt loop on the monthly invoice.
our default · unless reason not to
Langfuse for prompt + trace observability across vendors; Braintrust or a hand-rolled pytest harness for the golden-set regression; shadow-mode mirroring before every cutover. AI observability is the highest-CPC keyword cluster on this hub for a reason: buyers are starting to ask. If there's no eval suite, there's no ship, even from us.
tools we name
- Anthropic API direct
- OpenAI API direct
- AWS Bedrock
- Azure OpenAI
- Google Vertex AI
- Cloudflare Workers AI
- PrivateLink · KMS · BAA
- Kubernetes / ECS for self-hosted open-weights
production failure modes
- AI vendor outage takes the whole product down with no multi-vendor failover.
- Prompt logs hit the wrong region, causing an accidental data-residency breach.
- Cold-start latency on serverless adds 800ms to a sub-second voice agent.
our default · unless reason not to
Anthropic + OpenAI direct for fastest model access; Azure OpenAI or AWS Bedrock when compliance posture (HIPAA BAA, SOC 2, FedRAMP) requires it. Multi-vendor failover wired in for any workload above monthly run cost. A single vendor at that scale is a liability we won't sign off on.

Defaults reflect our current 2026 operator playbook, picked per workflow, not per partner badge. The rationale for each is in the per-layer detail above.

claude vs gemini · anthropic vs openai · best llm for coding

Pick a model:
Claude · GPT · Gemini · open-weights.

The first decision on any AI build is the model — every other choice in your ai development tools stack flows from it. We ship all four families and pick per workflow on the eval data, not on the discovery-call demo. Below: when each one wins, when each one loses, and which sibling pillar covers the engagement detail. Claude vs gpt, the honest version, is almost always 'both, picked per workflow.'

Claude: Sonnet 4.6 / Haiku 4.5

Default for long-context tool use, production reasoning, and any workflow where instruction-following stability matters. Wins most coding-agent evals in our internal benchmarks (best llm for coding). Claude Code is the tool we use daily. Read the full Claude development pillar for the engagement detail.

OpenAI: GPT-5 / Realtime / Codex

Picked when the workflow needs the Realtime API (sub-second voice), GPT vision pipelines, image generation, or OpenAI Codex for engineering. ChatGPT brand recognition matters for some buyer-facing use cases. Read the OpenAI development pillar. Same operator team, different vendor.

Gemini: 2.0 / 1M+ context

Niche but real. Wins on massive-context document QA (1M+ token windows beat chunking + retrieval for some workloads) and tight Google Workspace integration. Not our default for general production work, but we'll ship it where the context size or the Workspace surface decides it.

Open-weights: Llama · Qwen · Mistral

Self-hosted Llama 3.3, Qwen 2.5, or Mistral when sovereign data, air-gapped deployment, or cost-floor-at-scale make the per-token bill on hosted vendors untenable. Run cost only, no per-token fee. The trade is operational overhead; if you don't have an SRE team, hosted vendors usually still win.

ai agent frameworks · best ai agent framework · langchain alternatives

Orchestration & agent frameworks —
LangGraph, OpenAI Agents SDK, CrewAI, AutoGen.

The four frameworks the 2026 SERP keeps citing, graded on production weight, vendor lock-in, and where each one breaks. LangGraph is our default for non-trivial agent work; the OpenAI Agents SDK wins when the team is GPT-only and wants vendor-native ergonomics. CrewAI lives in prototype land; AutoGen we rarely reach for. Full agent engagement detail lives on the AI agent development pillar.

Dimension

You're here LangGraph Stateful graph orchestration

OpenAI Agents SDK Vendor-native, GPT-only

CrewAI Role-based multi-agent

AutoGen Microsoft Research framework

Best when Where this framework fits naturally.

LangGraph Branching graphs · durable replay · multi-model routing

OpenAI Agents SDK GPT-only stack · vendor-native handoffs · short ramp

CrewAI Fast prototyping of role-based scenes · demos and PoCs

AutoGen Research-y multi-agent loops · less production tooling

Model lock-in How easy to swap vendors mid-build.

LangGraph Vendor-agnostic across Claude, GPT, Gemini, open-weights

OpenAI Agents SDK GPT-locked · porting cost if you ever need to switch

CrewAI Multi-vendor via LiteLLM under the hood

AutoGen Multi-vendor · OSS · pick your model adapter

Production weight How much framework you carry into prod.

LangGraph Heavy; worth it past ~6 steps, overkill under

OpenAI Agents SDK Light · idiomatic Python · easy to read 6 months later

CrewAI Light at first · gets brittle as scene grows

AutoGen Mid-weight; production patterns less battle-tested

Where it breaks The failure mode we've actually hit.

LangGraph State serialization gotchas when re-hydrating long runs

OpenAI Agents SDK Handoff tracing thin · debugging multi-agent gets murky

CrewAI Role prompts drift; quality regresses as agents accumulate

AutoGen Termination conditions vague; runs loop until budget

Our default pick Unless the workflow says otherwise.

LangGraph Yes, for branching, durable, multi-model agent work

OpenAI Agents SDK Yes, when the team is GPT-only and wants ergonomics

CrewAI Prototypes only · re-platform before going to prod

AutoGen Rarely · the others ship cleaner for the same job

Picks reflect production builds we've shipped in 2026. Frameworks below the row of 'mixed' verdicts are still useful; they just earn their weight less often than we'd hoped from their docs.

Building a multi-step agent? Start with the pillar.

We have a dedicated AI agent development pillar with our AgentRecipePicker, tool-schema patterns, and the agent engagement model. The matrix above is the shortlist. The full playbook is one click away.

Read the AI agent development pillar Book an AI stack audit

vector databases · rag frameworks · best vector database

Data & retrieval —
vector DBs, RAG, and the chunking question.

The data layer is where AI quality lives. The best vector database question matters less than the chunking and reranking strategy on top of it, and that strategy is where most production AI quality plateaus. We default to pgvector on your existing Postgres until scale or hybrid-search needs force a swap. Full data-flow detail lives on the AI integration services pillar.

Vector databases

Where the embeddings live. pgvector on your existing Postgres handles ≤2M chunks operationally, and is usually the right pick for the first build. Pinecone, Qdrant, and Weaviate take over past that scale or when you need hybrid search out of the box. Read the AI integration services pillar for the full data-flow story.

RAG frameworks & loaders

LlamaIndex for the loader / chunker / index layer. Its document parsing is harder to beat than people expect. LangChain only where its abstractions actually save time over plain Python. Unstructured.io for messy real-world inputs (scanned PDFs, slide decks, mixed-format archives). Best vector database is a less interesting question than best chunking strategy.

Reranking & hybrid retrieval

BM25 + dense vector + Cohere Rerank by default. The quality lift from reranking is consistently bigger than picking a fancier embedding model. If you've capped your RAG quality and are reaching for a bigger model to fix it, the answer is usually a reranker at a tenth of the cost.

Need AI plugged into your existing data and systems?

The AI integration services pillar covers the full DataFlowDiagram, PlatformExplorer for Salesforce / Slack / NetSuite, and the audit-log + retry patterns we ship.

Read AI integration services →

ai observability · llm observability · llm monitoring · mlops tools

Observability, eval, and cost —
the layer nobody on the SERP shows.

AI observability is the highest-paying keyword cluster in our research ($47.96 CPC on the head term), and zero of the top-10 listicles give it real treatment. This is where production AI lives or dies. Four pieces every serious stack needs: tracing, eval suites, drift monitoring, and cost telemetry. Read the AI development pillar for the eval and cost-of-ownership engagement detail; read the AI consulting pillar for the audit that picks them per workflow.

Tracing: Langfuse · Braintrust

Every prompt, completion, tool call, and cost recorded with a trace ID. Langfuse for OSS-friendly self-host; Braintrust when the team wants the managed product and the eval harness in one tool. Without trace data, the first you hear about a regression is a customer email.

Eval suites: Inspect AI · pytest

Golden-set regression run on every prompt change and every model swap. Inspect AI from the UK AISI for serious eval harnesses; hand-rolled pytest for narrow workflows. The eval set is the AI's unit test and the layer that survives the next model release. AI observability and llm observability live or die here.

Drift & quality monitoring

Shadow-mode mirroring before cutover; production sampling + LLM-as-judge scoring after. Catches the silent quality regression that token math alone won't. LLM monitoring is what separates a stack that's still shipping value at month 6 from one that quietly broke at month 2.

Cost telemetry: token + dollar

Per-workflow, per-customer cost dashboards. Wired in at the same layer as latency and error-rate, not bolted on by Finance later. We've watched runaway loops cost a client $9K in a weekend — cost telemetry would have alerted at $200. Read the AI development pillar for the eval and cost-of-ownership engagement model.

ai orchestration platform · enterprise ai platform · ai tools for enterprise

Tool to engagement —
pick the job, we'll point at the pillar.

This is the hub's structural payload. Each row is a job-to-be-done buyers actually arrive on this page for; each row maps to the stack we ship, why we picked it, and the sibling service pillar where the full engagement detail lives. We don't try to ship the build from this page. We route you to the pillar that does.

Dimension

Recommended stack What we ship today

Why we pick it The decision, in one line

You're here Read the full pillar Where the engagement detail lives

Build a generative AI app end-to-end Frontend + backend + retrieval + agent + eval.

Recommended stack Claude Sonnet · Next.js or Flutter · pgvector · LangGraph · Langfuse

Why we pick it Default stack for production GenAI: model-agnostic, eval-first

Read the full pillar → /services/ai-development/ (AI software development company)

Automate a back-office workflow with AI Agents doing the work, not just summarizing it.

Recommended stack LangGraph · Temporal · Claude Haiku · Salesforce/NetSuite/Slack tool use

Why we pick it Durable orchestration over your real systems · 6–8 week pilot

Read the full pillar → /services/ai-automation/ (AI automation agency)

Build a multi-step autonomous agent Tool use, memory, planning, recovery.

Recommended stack LangGraph + Claude Sonnet · custom tool schemas · AgentRecipePicker patterns

Why we pick it Best-of-breed agent framework + most stable tool-use model in production

Read the full pillar → /services/ai-agent-development/ (AI agent development)

Plug AI into Salesforce, Slack, NetSuite, etc. Integration into your existing system of record.

Recommended stack Claude or GPT · platform-native SDK · PlatformExplorer pattern · audit logs

Why we pick it Treat AI as one more service in your integration mesh, not a new platform

Read the full pillar → /services/ai-integration-services/ (AI integration services)

Ship a customer-service chatbot Web · WhatsApp · voice · Slack · email.

Recommended stack Claude Sonnet · RAG over docs+tickets · ChannelMatrix · human-in-loop fallback

Why we pick it Tier-1 deflection without the failure modes of pure-LLM bots

Read the full pillar → /services/ai-chatbot-development/ (AI chatbot development)

Build on Anthropic Claude specifically Sonnet 4.6 / Haiku 4.5 / Claude Code agents.

Recommended stack Claude Sonnet 4.6 · prompt caching · Anthropic computer use · LangGraph

Why we pick it Long-context tool use, instruction-following stability, token-cost playbook

Read the full pillar → /services/claude-development/ (Claude developers)

Build on OpenAI GPT specifically GPT-5 · Realtime API · Codex · Assistants.

Recommended stack GPT-5 + Realtime API · Agents SDK or LangGraph · vision pipelines · Codex

Why we pick it Sub-second voice, vision, and the Codex eng workflow we use internally

Read the full pillar → /services/openai-development/ (OpenAI developers)

Get a strategy + roadmap before you build AI consulting · audit · readiness assessment.

Recommended stack Workflow audit · ROI prioritisation · model-pick rationale · 90-day roadmap

Why we pick it Fit-test before commit. Most teams arrive here, then pilot the top workflow

Read the full pillar → /services/ai-consulting/ (AI consulting company)

Eight jobs, eight pillars. If the job-to-be-done isn't in this table, the discovery audit exists to find which pillar (or which non-AI answer) fits. Sometimes the right answer is a non-AI baseline.

ai tools comparison · ai framework comparison · ai tools consulting

Build with a framework, or buy a platform —
ten honest checks.

Every listicle on this SERP gives every tool a glowing paragraph. Real buyer-discipline says some workflows shouldn't be built at all. Lindy or Stack AI or a $50/seat SaaS will outperform a custom build at a tenth the cost. Below: ten green/red flag checks we run before quoting a pilot. If the green column wins, we build. If the red column wins, we'll tell you to buy.

your vendor scorecard

0/10 keep looking

tap pass / fail on each criterion · saved locally in your browser

01
Eval set already exists

You have a golden set of ~50–200 real inputs with expected outputs. You know what "good" looks like and can measure it.

There's no eval set. Quality is judged by whichever prompt the founder demoed last. Build a regex baseline first.
02
The workflow is non-trivial

≥3 steps, branching logic, tool use against real systems, or genuinely novel reasoning. A framework earns its weight.

Single API call to a model with a prompt and a return value. Use the vendor SDK directly. Skip the framework entirely.
03
You need multi-vendor failover

Workload is above ~monthly or business-critical. Worth wiring routing across Claude + GPT + open-weights from day one.

PoC budget under $500/mo, single vendor for now is fine. Add abstraction when revenue says you can afford the overhead.
04
Volume justifies running the infra

Steady-state >10M tokens/month on a workload where open-weights would win on $/token. Self-host pays back in <6 months.

Bursty, narrow, or low-volume. Hosted Claude/GPT is cheaper than the SRE time to run open-weights.
05
Privacy/compliance forces self-host

HIPAA, FedRAMP, sovereign data, air-gapped deployment. The vendor BAA + PrivateLink genuinely doesn't cover it.

"We're worried about data" with no specific regulation. Azure OpenAI BAA + PrivateLink covers most of this. Buy the managed platform.
06
Latency budget is tight

Sub-second voice agent · live-stream UX · sub-200ms classification at scale. You need to control the runtime.

Async batch processing, human-in-the-loop, or anything where a 2-second response is acceptable. Buy the managed API.
07
Team can operate the layer

You have or can hire an ML/AI engineer who can debug a LangGraph state machine or a self-hosted Llama deployment.

Single full-stack dev who's never debugged a vector index. Buy a managed platform; operating cost will eat the savings.
08
Workflow shape will keep changing

Active product evolution, new tool integrations every sprint, prompt-engineering owned in-house. Framework gives you control.

Stable, narrow workflow that hasn't changed in 6 months. Managed SaaS (Lindy, Stack AI, Glean) is cheaper and faster.
09
Token cost is the bottleneck

You've maxed prompt caching, batch API, and model routing, and the bill still hurts. Open-weights or a custom optimization pass earns it.

You haven't tried prompt caching or batch API yet. Do those first; they typically cut effective cost to 8–15% of naive.
10
There's no off-the-shelf product

You looked. Honestly. Lindy, Stack AI, Glean, Zapier AI, vertical SaaS — none fit the workflow. Build is the only path.

There's a product that does 80% of it for $50/seat/mo. Buy that, integrate around the edges, save the build budget for what's actually unique.

Copy this rubric into your next AI vendor discovery call. If a vendor can't honestly score themselves on the red side of any row, that's the data point.

Sometimes the right answer is no AI framework at all.

Three cases where ai tools for business is the wrong frame entirely:

Regex beats an LLM. Form parsing on a single PDF type, structured ETL, narrow classification: a regex plus a small model outperforms LangGraph at a tenth of the cost.
A managed SaaS beats a custom build. If Lindy, Stack AI, Glean, Zapier AI, or a vertical SaaS covers 80% of the workflow for $50/seat/mo, buy it. Spend the build budget on what's actually unique.
An audit beats a pilot. If the eval set can't be built because the business process isn't measured yet, the AI build is premature. Fix the measurement first. The fixed-fee AI consulting audit exists partly to spot this.

We'd rather walk away from a pilot than ship a framework where one didn't belong.

ai tools consulting · engagement model

Three ways to start.
Audit, pilot, or continuous.

Same pricing as every sibling pillar: discovery audit, fixed-bid pilot, monthly continuous. The audit picks the tools for each workflow before any build budget gets committed; the pilot ships one workflow end-to-end on the chosen stack; the continuous team carries the next ones with quarterly re-evaluation. We pick across vendors openly, and we'll tell you when no tool is the right answer.

1–2 weeks

AI stack audit

We pick the tools and frameworks for each workflow before you commit a build budget.

Fixed-fee fixed

Workflow inventory + ranked ROI shortlist
Per-workflow model pick (Claude / GPT / Gemini / open-weights)
Recommended framework (LangGraph / OpenAI Agents SDK / plain Python / SaaS)
Vector DB + retrieval recommendation per data shape
Observability + eval-suite spec for the chosen stack
90-day implementation roadmap with named tools

Most teams start here

4–8 weeks

AI stack pilot

One workflow shipped end-to-end on the chosen stack, with eval, monitoring, and runbook included.

Fixed-bid fixed price

Eval set rebuilt against your real data
Build on the audit-recommended stack (model · framework · DB · obs)
Deploy behind a feature flag with shadow-mode mirroring
Token-optimization pass post-cutover
Walk-away point: if the metric won't move, no phase 2

Monthly

Continuous AI team

Embedded squad shipping the next workflow on the same stack, or migrating it when something better lands.

monthly per month

PM + AI engineer + ops analyst, embedded
Monthly cost-of-ownership + token-spend report
Drift, eval, and retry-rate monitoring
Tool/framework re-evaluation every quarter; we'll tell you when to swap
Cancel any month, no annual contract

Talk to us

Model-agnostic, openly Your repo, your prompts, your keys BAA / DPA available Tools picked per workflow, not per partner badge

frequently asked

Questions AI tool buyers ask most.
Honest answers, including when to walk away.

Which AI framework should we use for our use case?

The honest answer: it depends on the workflow shape, not the framework's marketing page. Three rules of thumb. (1) Under ~6 steps, no branching, narrow tool use → plain Python with the vendor SDK directly. A framework adds weight you don't need. (2) Branching graph, durable replay, multi-vendor routing → LangGraph. It's the most production-tested orchestration layer in the 2026 stack. (3) Multi-agent role-based scenes → CrewAI for the prototype, then re-platform to LangGraph before production. The discovery audit picks the framework per workflow with your eval data. We'd rather you not build on a framework that's wrong for the shape.

What's the difference between LangChain, LangGraph, CrewAI, and the OpenAI Agents SDK?

LangChain is the older general-purpose toolkit. It has useful pieces (loaders, retrievers), but production code on top of it tends to fight the abstractions. LangGraph is the same team's stateful graph framework, what we reach for when the orchestration is non-trivial; vendor-agnostic. CrewAI is role-based multi-agent — great for prototypes of "researcher + writer + reviewer" scenes, brittle as the cast grows. OpenAI Agents SDK is GPT-native with clean ergonomics if your stack is GPT-only and you don't need multi-vendor failover. Most langchain alternatives questions on this SERP collapse to: "use LangGraph for graphs, OpenAI Agents SDK for GPT-only, plain Python for everything under 6 steps." Read the full AI agent development pillar for our agent engagement model.

Should we pick Claude, GPT, or Gemini for our build?

We ship all three. Claude vs GPT vs Gemini, briefly: Claude Sonnet 4.6 wins most of our internal coding-agent and long-context tool-use evals. It's the default for production agent work and the model we'd reach for on best llm for coding. GPT-5 wins when the workflow needs the Realtime API for sub-second voice, vision pipelines, image gen, or the Codex engineering workflow. Gemini 2.0 wins on massive-context document QA (1M+ tokens beats chunking on some workloads) and tight Google Workspace integration. Open-weights win on sovereign data and cost-floor at scale. The right answer to anthropic vs openai is almost always "both" — pick per workflow on the eval, not per partner badge. See the dedicated pillars: Claude development and OpenAI development.

Do we need a vector database, or is the LLM's context window enough?

Both, often. The rule we ship: if your retrievable corpus is under ~200K tokens and changes rarely, paste it into the system prompt and use prompt caching. No vector database needed. If it's larger, changes, or needs precise retrieval, you need vector search. The best vector database question is less interesting than the chunking and reranking question. We default to pgvector on existing Postgres for ≤2M chunks (operationally simpler than another vendor) and Pinecone or Qdrant past that. Hybrid BM25 + dense retrieval + Cohere Rerank consistently beats picking a fancier embedding model. The retrieval engagement detail lives on the AI integration services pillar.

What does an "enterprise AI platform" actually mean — and do we need one?

"Enterprise AI platform" is the analyst-coined umbrella for everything in our 5-layer stack glued together with audit logs, RBAC, BAA paperwork, and a single vendor invoice. Real enterprise ai tools today sit in three buckets. (1) Hyperscaler stacks (AWS Bedrock, Azure OpenAI, Google Vertex) when compliance posture (HIPAA, SOC 2, FedRAMP) forces the buy. (2) Managed agent platforms (Lindy, Stack AI, Glean) when the workflow is narrow and the build doesn't earn its keep. (3) Custom stacks like ours when the workflow is unique enough that platforms can't shape it. Most enterprises we work with end up with all three: a hyperscaler for infra, a SaaS for narrow workflows, and a custom build for the high-leverage one. The audit picks the bucket per workflow.

How do you measure AI quality in production?

AI observability and llm observability is the layer most listicles skip and the layer where production AI lives or dies. Four pieces. (1) Tracing: every prompt, completion, tool call, and cost stamped with a trace ID. Langfuse for OSS-friendly self-host; Braintrust when the team wants the eval harness in the same tool. (2) Eval suites: golden-set regression run on every prompt change and every model swap. Inspect AI for serious harnesses; hand-rolled pytest for narrow workflows. (3) Drift monitoring: shadow-mode mirroring before cutover, production sampling + LLM-as-judge scoring after. (4) Cost telemetry: per-workflow, per-customer dashboards wired in at the same layer as latency. Without the eval set, you don't know when the model swap regressed quality. Without cost telemetry, a runaway prompt loop costs $9K before anyone notices.

How much does it cost to set up an AI stack?

Three engagement tiers, no surprises. A one-week ai infrastructure audit is fixed-fee: workflow inventory, model + framework pick per workflow, token-cost projection, observability spec, 90-day roadmap. A pilot is fixed-bid, 4–8 weeks, covering one workflow shipped end-to-end on the chosen stack with eval, monitoring, and runbook. A continuous AI team is monthly: embedded PM + AI engineer + ops analyst, monthly cost-of-ownership reporting, and quarterly re-evaluation of the tooling. Per-workflow run cost at steady state typically lands at $200–$1,500/month depending on volume and which model tier the workflow uses. Listicles avoid pricing because they don't sell the build. We do, so the numbers are on the page.

When is "no framework" the right answer?

Three honest cases. (1) The workflow is under ~6 steps with no branching. Plain Python and the vendor SDK is cleaner, easier to debug, and faster to ship than LangGraph. (2) An off-the-shelf product covers 80% of it. If Lindy, Stack AI, Glean, Zapier AI, or a vertical SaaS does the workflow for $50/seat/month, buy it and save the build budget for what's actually unique. (3) The problem isn't AI-shaped. Regex, rule engines, classical ML (XGBoost, scikit) outperform LLM calls on cost and quality for plenty of workloads: form parsing on a single PDF type, structured ETL, narrow classification. The AI consulting audit exists partly to tell you when the answer is "don't build with an AI framework at all."

keep exploring · sibling ai pillars

Related pages.
Pick the engagement that fits.

Every section above ends with a sibling-pillar link. The six below are the full set. Each is a depth pillar with its own anchor diagram, capability patterns, and engagement model. This hub is the meta view; the pillars are where the build happens.

01 Service

Claude Development

Anthropic Claude: Sonnet 4.6 + Haiku 4.5 agents, Claude Code, prompt caching, and the token-optimization playbook.

02 Service

OpenAI Development

GPT-5 + Realtime API voice agents, Codex, Assistants, vision pipelines — the other half of the model stack we ship.

03 Service

AI Agent Development

Multi-step autonomous agents: LangGraph, tool use, memory, and the AgentRecipePicker patterns we ship in production.

04 Service

AI Integration Services

Plug Claude or GPT into Salesforce, Slack, NetSuite, and your data warehouse with audit-logged tool use.

05 Service

AI Software Development

End-to-end AI software development company work: eval-first, model-agnostic, with the stack-we-ship-in anchor diagram.

06 Service

AI Consulting

Strategy and roadmap before the build. Pick the right tools for the right workflows before committing a budget.

Ready to ship

Book an AI stack audit
and walk out with the tools picked.

One to two weeks, fixed-fee. We inventory your candidate AI workflows, rank them by ROI, pick the model and framework per workflow, recommend the data + observability + infra layer, project token cost at steady state, and hand you a 90-day implementation roadmap. No deck, no obligation to build with us afterward.

See the related pillars

Fixed-fee · 1–2 weeks · no obligation to pilot Model-agnostic across Claude, GPT, Gemini, open-weights Tools picked per workflow, not per partner badge

Updated May 20, 2026 · By Navin Sharma

AI tools for business, frameworks, models, observability, picked per workflow.