Score Agents Like Employees, Not Like Models
You're grading your AI agent on the wrong rubric.
Code correctness tells you if the output compiles. It tells you nothing about whether the agent will drift, ignore context, or go silent when it hits a wall.
The benchmark trap
Your agent passes the coding benchmark. It writes syntactically correct code. It handles the toy problem in the eval suite.
Then you put it on a real task: refactor this authentication module, follow our patterns, don't break the existing tests.
It writes code that compiles. It also ignores half the context you gave it, doesn't tell you when it's stuck, and makes changes you didn't ask for while missing the ones you did.
The benchmark said it was capable. Production said otherwise.
The rubric shift
OpenAI's applied team grades their coding agents differently. They treat the agent like an employee, not like a model.
"If you design your coding evals like you would a software engineer performance review, then you can measure their ability in the same ways as you can measure somebody who's coding."
Brian Fioca,
The rubric they use for GPT-5 development:
- Proactivity: Does it go ahead and do all of it, or does it stop and wait when it could keep moving?
- Context management: Can it keep all of the context it needs in memory without getting lost?
- Communication: Does it tell you its plan before executing? Does it surface when it's stuck?
- Testing: Does it validate its own work, or does it hand you untested code?
These aren't code quality metrics. They're work style metrics. The difference matters.
Why correctness evals miss the failure modes
A code correctness eval asks: "Did the output match the expected output?"
A work style eval asks: "How did it get there, and what would happen if the task were harder?"
An agent that scores high on correctness but low on communication will confidently produce wrong code without flagging uncertainty. An agent that scores low on context management will lose track of requirements halfway through a multi-file change. An agent that scores low on proactivity will stop and wait for you to hold its hand on every sub-task.
These failure modes don't show up in benchmarks. They show up at 2am when you realize the agent silently ignored half your instructions.
Benchmark approach vs. work style approach
| Dimension | Benchmark Approach | Work Style Approach |
|---|---|---|
| What it measures | Code correctness on isolated tasks | Behavior patterns across complex workflows |
| Failure modes caught | Syntax errors, wrong outputs | Drift, context loss, silent failures |
| Task realism | Toy problems, synthetic evals | Multi-file changes, production patterns |
| Feedback loop | Pass/fail on expected output | Grades on proactivity, communication, testing |
| Production readiness signal | "It can write code" | "It can work reliably on your team" |
The prompt is the job description
The framing shift is simple: your prompt is a job description.
"You're giving it a job description. That's your prompt."
Brian Fioca,
If you hired an engineer and gave them vague instructions, you'd expect vague output. The same applies here. But beyond prompt quality, you need to know how the agent performs when the instructions are clear and the task is complex.
That's what work style evals measure.
How to build the rubric
The approach: human-grade first, then tune an LLM-as-a-judge until it matches your scoring.
- Run a set of realistic tasks (not toy problems)
- Have humans grade the agent's work on proactivity, context management, communication, and testing
- Build an LLM-as-a-judge that attempts to replicate the human grades
- Iterate until the automated judge correlates with human judgment
- Use the automated judge for scale; spot-check with humans
The tradeoff: this takes more upfront work than a correctness benchmark. The payoff is catching failure modes before they hit production.
How Roo Code closes the loop on agent reliability
Roo Code is an AI coding agent that closes the loop: it proposes diffs, runs commands and tests, and iterates based on results. This maps directly to the rubric:
- Proactivity: Roo Code continues working through sub-tasks without waiting for hand-holding on each step
- Context management: The agent maintains context across multi-file changes within your VS Code workspace
- Communication: You see the plan before execution through the diff-and-approve workflow
- Testing: Roo Code can run your test suite and iterate on failures automatically
With BYOK (bring your own key), you control which model powers the agent while Roo Code handles the work style layer that makes agents production-ready.
Why this matters for your team
For a Series A - C team with five engineers, agent reliability is a force multiplier. If your agent drifts or goes silent on complex tasks, someone has to babysit it. That someone is an engineer who could be shipping.
Work style evals surface these problems before you've built workflows around an agent that can't handle the job. You find out in the eval, not in the incident postmortem.
The rubric: proactivity, context management, communication, testing. Grade your agent like you'd grade a junior engineer on a trial period.
If it can't tell you its plan, it's not ready for production.
Frequently asked questions
Stop being the human glue between PRs
Cloud Agents review code, catch issues, and suggest fixes before you open the diff. You review the results, not the process.