Let's Eval
Posts
How to fix the 2 biggest Evals mistakes

How to fix the 2 biggest Evals mistakes

(From a top Google researcher)

Alex Rudall
February 15, 2025

Intro

I’m Alex, I created ruby-openai and now I build LLM Evals systems for VC-funded startups like DoubleLoop and Tropic. In this post I’ll share 2 common Evals mistakes and how to fix them, learned from listening to a podcast interview with AI researcher Shreya Shankar (Google, UC Berkeley, etc.).

(PS - hey existing subscribers to ‘FirstWeSell’ - I’m pivoting this newsletter to focus on making AI more reliable - please do unsubscribe if this is not useful to you!)

1) You Need A Data Flywheel

Although ‘traditional Machine Learning’ has been around for a long time, LLMs being cheap and good enough has now opened up productionized AI to I guess 1000x more developers than a few years ago; so, amid the flood of hype and money there is a lack of clearly accepted practices amongst the influx of newbies.

A phrase I liked from the episode was Shankar’s ‘Data Flywheel’, the idea that in production LLM products you need an ongoing feedback loop to continually improve the system.

This tallies with my thoughts in this area and experience as a startup web engineer; real value is created where the rubber hits the road, where we can build systems of humans + machines that work effectively and with a mechanism for ongoing stability and improvement. Software is a living thing and needs to be continually improved, forever or else it dies.

In some way, you need to be able to label those efficiently, correlate that against your human judgment, and then feed that back in to improve your base prompt with either a few short examples or some demonstrations. Thinking about dimensions in which how you want to improve your prompt, coming up with good metrics to evaluate, I think that's the process in which we start thinking about when we want to create such a flywheel.

Shreya Shankar

2) LLM Judges Need Examples

It’s not enough to just ask an LLM to evaluate your prompt inputs/outputs - since LLMs are trained on such wide data, and since use cases vary so much even within industries, you must include good and bad examples in all evals prompts. This is a really valuable tactical point and I’ll be using it going forward.

Something that is extremely successful for LLM as judge, which papers are not writing about because it's so new, I think it's tied to the release of GPT-4o, is providing good few-shot examples. So if I have a metric that's evaluated that this tone is professional, put into examples that are professional tone, put into examples that are unprofessional tone, ship it. This validator is very aligned with what you might think is professional, much more so than if you didn't have examples.

I think an anti-pattern is precisely the opposite of that, is people will ship off calls to LLMs to be a judge with some prompt, and there's absolutely no specification of what professional means to them or what good means to them. It doesn't even have to be subjective.

Shreya Shankar

Thanks for reading! Please share this post on your socials if you found it useful.

Episode Link

Building Reliable and Robust ML/AI Pipelines

Hugo speaks with Shreya Shankar, a researcher at UC Berkeley focusing on data management systems with a human-centered approach. Shreya's work is at the cutting edge of human-computer interaction (HCI) and AI, particularly in the realm of large language models (LLMs). Her impressive background includes being the first ML engineer at Viaduct, doing research engineering at Google Brain, and software engineering at Facebook.

vanishinggradients.fireside.fm/32