Let's Eval
Posts
Start with Evals!

Start with Evals!

(In a fast-feedback process architecture)

Alex Rudall
February 17, 2025

Intro

These are my notes from listening to a podcast interview with Sierra’s Arya Asemanfar.

1) You Need An Architecture

The human processes matter more than the technical implementation of evals.

And so ultimately, I think what ends up happening when you try to build an agent using AI that's going to work at scale, you have to create an architecture around it. This is probably true even for non-agents. If you're going to stand up a business, technically, you can set up a lemonade stand and have a business going. But to have that actually be efficient and scale and make sense, you got to create processes and you got to figure out how to have this be repeatable, have it be profitable and all that.

Arya Asemanfar

2) You need a feedback loop

A good architecture will allow for as rapid iteration as possible.

So I think if you can't experiment, like the learning loop ends up being really slow, you can't push to production and then get the data back. It's got to be like a fast loop. Yeah.

The faster you make it, I think the faster. It's not just about the speed of delivery, but actually like accelerates the pace of learning. And if you learn more, by learning, I mean, as a human, not like machine learning.

If you learn more, you'll just be able to get further. And I mean, I want to make some of this as practical as you're willing, right? I think I recognize some of this could be like secret sauce for how you all do things, but this topic of evals, I think we all know it's like critical for the reasons you just described.

Arya Asemanfar

3) Skinny agents, fat foundation models

Small focused agents work better than mega-prompts.

So I think figuring out how to decompose what you're trying to do into a reliable set of steps that LLMs are good at today, and then being able to compose that together in a way that can, when you compose them, reliably produce the custom interaction that you want, or the product behavior that you want, that ends up being a much more effective process than trying to prompt engineer one giant prompt into doing what you want.

Arya Asemanfar

4) Evals should be specific

This is something I’ve seen come up a couple of times now. Make your evals focused on as specific a real-world use case as possible. Don’t try and make them generalised.

I think it's too easy to get in your head too much about what is the right solution to a particular problem in the abstract, but keeping it hyper-focused on one customer or one concrete example of how something should go. And then that helps you build a foundation from which you can now have a more informed, intuitive mental model for is super powerful. I think just because the space is so new, we need to build our intuition through concrete experience.

Arya Asemanfar

5) Start with Evals!

Evals is the TDD of AI. The vibe check is not good enough for critical production systems.

Yeah, I mean, I think actually like evals is a good one. I think we built evals internally, but I think we built it later than we should have. I think going back to what I was saying earlier about like the how fast you learn, I think we would have learned faster if we had evals earlier.

So I think that's probably a mistake going in hindsight, something I would have done differently. Yeah, the kind of test driven development analogy is emerging more and more for people building in this space, like start with your evals, build from there.

Arya Asemanfar

Episode Link

Building Enterprise-Grade AI Agents: Lessons from Sierra's Arya Asemanfar

Podcast Episode · Deployed: The AI Product Podcast · 01/23/2025 · 43m

podcasts.apple.com/us/podcast/building-enterprise-grade-ai-agents-lessons-from-sierras/id1767974576?i=1000685091042