My best thoughts on Evals systems

No silver bullets

I’ve now built 2 evals systems for major clients. Here’s what my current best understanding is of what these systems should include.

1) Start with Evals

Your Evals are your AI test suite.

They’re a foundational AI engineering skill.

They’re a good place to start.

2) Dataspec

Choose ONE domain expert who isn’t a coder.

They will be responsible for tuning these evals.

Create a list of discrete, focused, specific questions, on a specific customer/usecase, using real data.

The domain expert creates their best answers to these questions.

Manually run the evals on the production AI system.

The domain expert marks the evals as pass or fail, with an explanation.

3) LLM-as-judge

Build a simple LLM-as-judge system.

It should be possible to run the entire suite with one click and it be possible for it to pass.

The rubrics are prompts that will need tuning. One per Eval.

They should include multiple Good and Bad examples for each expectation.

Each expectation should have a binary pass/fail score (use 0 and 100 as the options if you like for future flexibility). More complexity than this (eg. a % score) is too fine-grained for the beginning when you need clear results to work with.

You need a fast feedback loop - run the suite, get results, make a change, run again in less than 10 seconds.

4) Build the data loop

The domain expert compares the answers and explanations from the system to their own.

Add good and bad examples to the prompt and rerun until their judgement of the system matches the LLM-as-judge.

The domain expert can also tweak their own explanations as they get to know the LLM.

5) Process

Build a daily/weekly process where the Evals are run, the domain expert reviews them, and improves the prompts to get closer to what it should be.

Have a Slack channel where people share an AI problem they came across and how to reproduce it

Each of these problems should become either an improvement to an existing eval, or a new eval.

Evals is a process of continual learning and improvement, not a silver bullet.

Resources