How to Actually Evaluate an AI Vendor

The first AI demo I sat through that genuinely impressed me was for a document intelligence product. The system pulled structure from unstructured contracts with accuracy that felt like magic. The vendor’s team was sharp, the use case was concrete, and the timing was right — we had a real problem this thing could solve. We signed a pilot within six weeks.

Three months later, the accuracy numbers in production were roughly half what the demo showed. Not because the vendor had lied, exactly. Because the demo used a document corpus that happened to be structurally clean, consistently formatted, and well within the model’s training distribution. Our actual document library was none of those things.

That experience changed how I evaluate AI vendors. The gap between demo performance and production performance is the central fact of AI procurement, and it’s structural — not a sign that anyone is being dishonest. The gap exists because the demo is an optimized environment and production is not. Your job as a buyer is to collapse that gap before you sign, not after.

Ask for adversarial inputs, not showcase inputs

The standard demo flow gives you the vendor’s best cases. What you want is their worst cases — specifically, the inputs that are most representative of your actual data, including the messy, ambiguous, inconsistently formatted material that makes up the long tail of any real-world corpus.

Before any serious evaluation, I now ask vendors for a sandboxed environment where I can run my own test cases without sharing my data. Not their test cases. Mine. This request separates vendors quickly. The ones who push back — who want to curate the evaluation, who insist that their test suite is sufficient — are telling you something important about how they’ll handle production issues.

The categories of adversarial inputs worth testing are specific: edge cases that are semantically valid but structurally unusual, inputs with deliberate ambiguity that a human would resolve through context, cases where the “right” answer is genuinely uncertain, and inputs that are just outside the stated scope. That last category is especially important. AI systems fail gracefully or they fail catastrophically, and you want to know which before you’re in production.

Interrogate the latency and cost curves

AI products have non-linear cost and performance characteristics that traditional software doesn’t. A system that works at 100 queries per day may behave differently at 10,000 queries per day — not just in throughput but in output quality, because many systems make different computational tradeoffs under load. Ask specifically: what changes at 10x your expected volume? What are the rate limits? What are the cost-per-query economics at that scale, and what’s the contractual ceiling?

The latency question is equally important and equally often glossed over in demos. The vendor will show you p50 latency. You need p95 and p99. You need to know what the latency distribution looks like on a Monday morning when enterprise customers are all doing their heaviest work simultaneously. A product that feels fast in a demo and slow in production is a different product.

The model update problem

This is the thing nobody talks about in sales cycles. Foundation model vendors update their models. Sometimes the updates improve the behavior you care about. Sometimes they don’t. In either case, the behavior changes — and if your product is built on top of a model that just changed, your product changed without you deciding that it would.

Ask every AI vendor two questions: How are model updates communicated before they happen? And do you offer version pinning? The answers tell you about their engineering culture and their relationship with their customers. A vendor who treats model updates as features rather than risks to be managed is a vendor who has never had to explain to a customer why their outputs changed on a Tuesday.

The deeper issue is that most AI vendor contracts are written like SaaS contracts, with SLAs defined around availability rather than output quality. Availability is easy to measure. Quality is not. That asymmetry means you can have a vendor who is technically meeting their SLA while delivering a product that’s meaningfully degraded from what you evaluated. Build evaluation metrics into the contract, or at minimum into your internal review cadence. Don’t rely on the vendor to tell you when things get worse.

How to Actually Evaluate an AI Vendor

Ask for adversarial inputs, not showcase inputs

Interrogate the latency and cost curves

The model update problem

Share this article

Tags:

Keep reading

The Model Quality Problem Nobody Talks About in Product Reviews

Fine-Tune or Prompt Engineer — A Product Leader's Decision Framework

Build vs. Buy in the AI Era — The Calculus Has Changed

The things nobody writes on LinkedIn