Evaluate before you buy: harnesses, test sets and quality criteria

An AI demo can be convincing. Five sample questions work, the interface responds quickly and the supplier shows a high score on a public benchmark. After that, moving to a model, GPU cluster or private inference environment can feel logical. Yet this is exactly where many organisations decide too quickly.

A demo usually does not test your documents, Dutch language use, authorisation rules, peak load or privacy requirements. A public benchmark also says little about whether a model handles internal terminology, long support tickets, tenant data or retrieval from your own knowledge base. Without your own evaluation framework, you are mostly buying on gut feeling.

The previous articles covered public AI versus private AI, private inference in practice and when private inference is suitable. This article covers the next step: how do you determine before buying whether a model, prompt, RAG setup and infrastructure are good enough?

From demo to decision

What you need to prove first

A demo shows that something can work. These checks show whether it also fits your data, risk and usage.

Not enough

Five good demo questions

They say little about your real tickets, documents, permissions and peak load.

What you do need

Your own test set

Use cases from your own processes, including hard examples and refusals.

Ready for a decision

Thresholds in advance

Define quality, security, p95 latency and costs before comparing options.

A benchmark is not a buying decision

Public benchmarks are useful as a starting point. They give a rough impression of reasoning ability, coding tasks, general knowledge or language skills. They also help you exclude models that clearly underperform. But they are not a replacement for testing on your own workload.

Most public benchmarks measure standardised tasks. Your application is often much more specific. Think of a chatbot for Dutch insurance conditions, an assistant for support teams, a summariser for legal documents or an AI feature in a SaaS product. In those cases, it is not just whether the model looks smart. It is whether it consistently gives the right answer within the context, without making up information or leaking data.

Use public benchmarks as a baseline. They can help with a first shortlist of models. The real decision comes only after you have run your own test set on the models, prompts, parameters and infrastructure you are actually considering.

Evaluation path

Baseline

Use public benchmarks to build a shortlist.

Golden dataset

Test with real questions, documents and risks from your own organisation.

Eval harness

Run the same set repeatably across multiple models and backends.

Decision

Choose only when quality, risk, latency and costs are within your limits.

Start with your own golden dataset

A good evaluation starts with your own test set. This is often called a golden dataset: a set of examples where you know in advance what a good answer looks like, or at least how a good answer should be judged. That dataset does not need to be large at first. A carefully selected set of fifty to two hundred cases can already tell you much more than a generic model score.

Use examples from real processes. Think of support questions, internal knowledge base articles, contract fragments, product documentation, incident reports, technical manuals, customer emails or SaaS workflows. Do not only include clean examples. The hard cases are especially valuable: incomplete questions, ambiguous wording, outdated documents, conflicting context, sensitive data and requests the system must refuse.

Record at least the following for each test example:

the question or task from the user;
the context, documents or records that may be available;
the expected answer, or a rubric for assessing the answer;
the data class, for example public, internal, confidential or personal data;
the risk if the answer is wrong;
whether the case is must-pass or only counts towards the total score.

For some tasks, one exact answer is possible. For other tasks, such as summarisation or classification, assessment criteria work better. A summary can be good if it includes all main points, adds no new facts and uses a tone that fits the channel. Make those criteria explicit. Otherwise, the assessment becomes a debate about taste afterwards.

Make the test repeatable with an eval harness

An eval harness is the layer that runs your test set repeatably. It can be an existing tool, your own script, or a combination of notebooks, CI jobs and dashboards. The form matters less than the discipline: the same input must be runnable again under the same conditions.

A usable harness records per run which model was used, which model version, prompt, system prompt, parameters, dataset version and infrastructure. Parameters such as temperature, top-p, max tokens and context length belong in the run metadata. The same applies to RAG settings such as chunk size, embedding model, reranker, top-k and filters.

Also store the raw outputs. A total score alone is too limited. You need to be able to trace why a model scored worse, which cases failed and whether a change mainly affected quality, latency or costs. For serious decision-making you need example-level results: input, retrieved context, answer, score, latency, token usage, error message and any reviewer notes.

With an eval harness like this, you can compare multiple models or backends without inventing a new test method each time. You are not comparing demo A with demo B, but run A with run B on the same set. That makes conversations with suppliers, internal teams and security more concrete.

Test RAG in two layers

With retrieval augmented generation, usually called RAG, a mistake can happen in two places. The system can retrieve the wrong context, or the model can answer poorly based on good context. If you only assess the final answer, you do not know where the problem is.

Test retrieval separately first. Are the right documents, passages or records retrieved? Is the relevant information high enough in the results? Do filters for customer, tenant, role, language, version and document type work properly? Metrics such as context recall and context precision help here. Context recall checks whether the required information was retrieved. Context precision checks whether the retrieved context is mostly relevant rather than noise.

Then test answer quality. A good answer is not just fluent. It must fit the question, be based on the retrieved context and be complete enough for the task. Watch faithfulness or groundedness: is the answer supported by the context, or does the model add its own assumptions? Also measure answer relevance, completeness and correctness. For applications with citations, you can also check whether claims are linked to the right passages.

This separation prevents the wrong conclusions. A better language model does not fix poor retrieval. And a better search layer does not help if the model still makes things up or ignores policy.

Include security and privacy from day one

A model that answers normal questions well can still be unsuitable for production. Private inference often involves data you do not want to send to a public service. That means you also need to test how the solution behaves under misuse, mistakes and edge cases.

Include security and privacy cases in the same evaluation set. Test direct prompt injection, where a user tries to override instructions. Test indirect prompt injection, where malicious instructions are embedded in a document, web page or ticket. Test jailbreaks, system prompt leakage, PII leakage and repetition of confidential information from context. If you work with multiple customers or tenants, explicitly test cross-tenant access: a user from customer A must never receive context from customer B.

If you use tools, agents or actions, also test tool misuse. Is the model allowed to make an API call based on incomplete information? Can a user use a prompt to start an export, email or change that is outside their permissions? System prompts and model instructions are not access control. Authorisation, filtering, logging and separation of data must be enforceable outside the model.

This directly touches digital sovereignty. Control over location is useful, but not enough. You also want control over access, processing, logging, model behaviour and how test data is used.

Acceptance criteria

Four measurement areas for your buying decision

Define what is good enough for each area before you compare models or infrastructure.

Quality

Correctness, completeness, relevance and consistent refusals.

RAG

Context precision, context recall, groundedness and citation coverage.

Security

Prompt injection, jailbreaks, PII leakage and tenant isolation.

Performance

p95, p99, TTFT, tokens per second, queue and cost per successful request.

Measure performance as users experience it

Average latency says little. Users mainly notice the slow cases, the queue under peak load and the time to first token. Therefore measure p95 latency and p99 latency, time to first token, tokens per second, total processing time and error rates. Do this not only with one user, but also with the concurrency you expect in production.

Private inference has its own performance questions. Does the model fit in VRAM? What happens when several users submit long context at the same time? How does the KV cache behave? How large does the queue become? Does output token throughput drop with longer answers? Which batch settings improve throughput, and which make the interaction too slow?

Also measure costs in a way that fits your usage. Token price or GPU hourly rate is only one part of the picture. For a practical comparison, you can look at cost per successful request: infrastructure cost per hour divided by the number of successful requests per hour within your SLO. Requests that fail, are too slow or need to be retried do not count as successful.

This prevents a distorted picture. A cheap model that makes many errors or needs many retries can be more expensive in practice than a heavier model that more often gives a usable answer in one go.

Set thresholds before you compare

An evaluation only works if you decide in advance what is good enough. Otherwise the discussion shifts after every test. A disappointing security test suddenly becomes "acceptable for the pilot", or a slow p95 latency is dismissed because average latency looks good.

Define thresholds per category. For example: all must-pass privacy and security cases must pass. There must be no cross-tenant leakage. At least ninety percent of must-pass quality cases must be correct. Retrieval recall must exceed an agreed limit. P95 latency must fit the user experience, with stricter limits for interactive chat and more room for batch processing. Cost per successful request must fit the business case.

Those limits do not need to be the same for every organisation. An internal assistant that drafts text has different requirements from a customer-facing support bot or a system that summarises legal documents. The point is to choose the limit upfront and use it to compare models, prompts, RAG settings and infrastructure fairly.

Use your evaluation set as a regression test

An evaluation set is useful not only for purchasing or a PoC. It becomes more valuable once you keep using it. Every change can cause regression: a new model, another prompt, adjusted chunking, a new embedding model, a reranker, different quantization, a new serving backend or different hardware.

Run the same eval harness again for every relevant change. Compare not only the total score, but also which cases improve and get worse. Sometimes a model update produces better general answers, but worse refusals for sensitive data. Sometimes quantization lowers costs, but the model loses just the difficult Dutch-language cases. Sometimes a faster backend improves p50 latency while p99 gets worse under peak load.

By making evals part of operations, you prevent quality from slowly drifting. You are not only building a purchasing test, but also a way to change your AI application in a controlled way.

What this means for infrastructure

Only when workload, data, test set and acceptance criteria are clear can you make a meaningful infrastructure choice. A public AI service may turn out to be enough. A hybrid setup may fit better. Or private inference on dedicated or private cloud infrastructure may be needed because of data, compliance, latency or predictable costs.

For AI infrastructure, the order matters. Do not start with which GPU you need. Start with which tasks must pass, which data is processed, which risks you do not accept and which performance users need. Only then can you make a grounded assessment of model size, serving stack, capacity, monitoring and operations.

cloud.nl can help design AI infrastructure once those points are concrete. Not based on a demo, but based on your workload, dataset, security requirements, latency targets and cost model. That makes the choice less dependent on opinion and easier to test.