Model Recovery Beats Raw Success Rate for Production Workflows
The model fails a tool call. Then fails the same way again. And again.
Three attempts. Same error. Zero learning.
The loop problem
You're watching a model debug a test failure. It tries to read a file that doesn't exist. The call fails. Instead of trying a different path or asking for clarification, it tries the exact same call again.
And again.
And again.
The context is poisoned. The failed tool call became the new pattern, and now the model is committed to beating its head against the wall until you kill the task or run out of tokens.
This is the difference between a model with a 95% success rate that loops forever on failures, and a model with a 90% success rate that recovers and tries a different approach. The second model is more useful for production workflows.
What makes a model recoverable
The problem isn't occasional failures. Every model fails sometimes. The problem is what happens next.
A recoverable model treats failed tool calls as information, not as a template. If a file read fails, it looks for alternative paths. If a command errors out, it parses the error message and adjusts. The failed attempt gets purged from the pattern-matching context instead of becoming the new baseline.
"It's not about, you know, failing the occasional right to file. It's about not getting stuck on a loop after it fails a single call, which we've seen with some models after they fail once, they just continue failing forever."
Dan,
A non-recoverable model does the opposite. It sees the failed call, treats it as a valid approach, and doubles down. Each retry reinforces the pattern. The context window fills with repeated failures, which makes recovery even less likely.
The evaluation shift
When teams evaluate models for agentic work, the standard benchmarks focus on raw success rates. Can the model solve the task? What percentage of the time?
Those benchmarks miss the loop problem entirely.
A model that passes 95% of tasks but loops indefinitely on the other 5% might be worse than a model that passes 90% and recovers gracefully on the remaining 10%. The first model wastes hours of wall-clock time and burns through your token budget on failures that will never resolve. The second model either finds an alternative approach or surfaces the failure clearly so you can intervene.
"So if the model, you know, fails occasionally, probably more often than Sonic 4, but it's able to keep going even if it fails, I think that makes it a useful model for sure."
Dan,
The tradeoff is explicit: you might accept a lower raw success rate in exchange for graceful degradation. The question isn't "how often does this model succeed?" It's "what happens when it fails?"
Testing for recovery
If you're evaluating models for production agentic workflows, add failure recovery to your test suite.
Run tasks where the expected file doesn't exist. Run tasks where the first command returns an unexpected error format. Run tasks where the context starts with a misleading assumption.
Then watch what happens on the second attempt.
Does the model parse the error and adjust? Does it try a different approach? Or does it loop?
"Basically if it can sort of purge that poisoning and move on instead of taking that failed tool call as the new the new... this is the new pattern to follow and I'm just going to keep beating my head against the ball."
Dan,
Why this matters for your workflow
If you're running agentic tasks on real codebases, you will hit edge cases. Files get renamed. APIs return unexpected formats. Test environments drift from production. The model will fail.
The question is whether that failure costs you five seconds of recovery time or five hours of loop-watching before you realize nothing is going to change.
When you evaluate models, test for recovery. Count the loops, not just the wins.
How Roo Code closes the loop on model failures
Roo Code is designed to close the loop - running commands, observing results, and iterating based on what actually happened. When a tool call fails, Roo Code can parse the error output and adjust its approach rather than repeating the same failed pattern.
With BYOK (bring your own key), you can test different models against your actual codebase to evaluate their recovery behavior. Some models handle failures gracefully; others loop. Roo Code lets you swap models without changing your workflow, so you can find the one that recovers best for your specific use cases.
The key insight for production workflows: an agent that closes the loop treats failures as feedback, not as templates to repeat.
Comparing model evaluation approaches
| Dimension | Raw success rate focus | Recovery-aware evaluation |
|---|---|---|
| Primary metric | Task completion percentage | Task completion plus failure behavior |
| Failure handling | Not measured | Explicit test cases for recovery |
| Token efficiency | Ignored | Tracks cost of failure loops |
| Wall-clock time | Ignored | Measures time to resolution or escalation |
| Production readiness | Misleading | Accurate predictor of real-world performance |
Frequently asked questions
Stop being the human glue between PRs
Cloud Agents review code, catch issues, and suggest fixes before you open the diff. You review the results, not the process.