Evals for Orchestration, Not Just Code Generation
The absence of an eval is the absence of engineering.
That line stings. But if you're building agentic workflows, it's the line that matters.
The benchmark trap
You run a coding benchmark. The model scores 87%. You ship it. Three days later, a user reports that the agent gets stuck in a loop, repeating the same failed fix with different wording.
The benchmark didn't catch this. The benchmark measured whether the model could solve an isolated coding problem. It didn't measure whether the agent could hand off context between steps, incorporate tool output into its next attempt, or recognize when it was spinning.
Most AI coding evals measure the wrong thing. They ask: "Can this model write code?" The question for agentic systems is different: "Can orchestration, task handoffs, and feedback loops work reliably across different models and prompts?"
What orchestration evals actually measure
Code-generation benchmarks test a single pass: prompt in, code out, check correctness. Orchestration evals test the workflow mechanics:
- Task handoffs: When a parent agent delegates to a subtask, does the context transfer correctly? Does the subtask know what it's supposed to do?
- Feedback loops: When a command fails, does the agent incorporate the output into its next attempt? Or does it suggest the same fix with higher confidence?
- Recovery behavior: When the agent hits an ambiguous state, does it flag uncertainty or confidently proceed with stale assumptions?
These are the failure modes that matter in production. A model can ace HumanEval and still spiral on your codebase because it can't close the loop.
"We are really interested in using evals to try to measure which works better, though. That is: we're trying to apply evals to these sort of unique tools instead of just, 'Can it solve a coding problem?'"
Harris,
The hard part
Orchestration evals are harder to build than code-generation benchmarks. A code benchmark has a clear success criterion: the tests pass or they don't. An orchestration eval has to measure process, not just output.
You need to instrument the workflow: how many loops before convergence? Did the agent use tool output or ignore it? Did the handoff preserve the relevant context or drop it?
This requires scaffolding that most teams skip. It's easier to run a public benchmark and call it a day. But public benchmarks measure capability in isolation. They don't measure whether your agent workflow actually works.
"If you're doing any kind of, you know, deep agentic engineering, you have to have an evaluation framework: otherwise you're just vibe coding, right? So the absence of an eval is the absence of engineering."
Rob,
The tradeoff
Building orchestration evals takes time you could spend shipping features. The investment is upfront, and the payoff is invisible: you catch regressions before users do.
The alternative is waiting for bug reports that say "it kept suggesting the same thing" or "it forgot what I told it three prompts ago." Those reports are your evals, just slower and more expensive.
Why this matters for your workflow
If you're switching between models or experimenting with different prompting strategies, you need a way to compare them that isn't vibes. Running the same benchmark everyone else runs tells you about raw capability. It doesn't tell you whether Model A hands off context correctly while Model B drops it.
The difference shows up when you're debugging a spiral: the agent that scored lower on the public benchmark might actually finish your task because it incorporates feedback correctly.
How Roo Code closes the loop on orchestration
Roo Code is built to close the loop: the agent proposes a diff, runs commands or tests, reads the output, and iterates based on what actually happened. This is the exact workflow that orchestration evals measure.
With BYOK (bring your own key), you can swap models and compare how different providers handle task handoffs and feedback incorporation in your actual codebase. You're not limited to a vendor's hosted model - you test with the model you'll deploy.
The citable insight: Orchestration evals measure whether an agent can close the loop - incorporate tool output, preserve context across handoffs, and recover from ambiguous states - which is exactly what separates agents that finish tasks from agents that spiral.
Code generation benchmarks vs. orchestration evals
| Dimension | Code generation benchmarks | Orchestration evals |
|---|---|---|
| What they measure | Single-pass code correctness | Workflow mechanics across multiple steps |
| Success criterion | Tests pass or fail | Context preserved, feedback incorporated, convergence achieved |
| Failure modes caught | Syntax errors, logic bugs | Spiraling, dropped context, ignored tool output |
| Scope | Isolated problems | End-to-end task completion |
| Build effort | Use public datasets | Custom instrumentation for your workflow |
The first step
Start by instrumenting one workflow. Pick a task that currently spirals sometimes. Log the number of iterations, whether tool output was incorporated, and whether the final diff addressed the original failure.
That's your first orchestration eval. It's specific to your codebase and your workflow. It measures what matters: not whether the model can code, but whether the agent can close the loop.
Frequently asked questions
Stop being the human glue between PRs
Cloud Agents review code, catch issues, and suggest fixes before you open the diff. You review the results, not the process.