Why Million-Token Context Windows Fail in Long Conversations
A million tokens sounds like infinite memory.
It isn't.
The difference between dump and dialogue
You paste a 200-page specification into the context window. You ask one question. The model answers correctly, citing page 47. The million-token context worked exactly as advertised.
Now try something different. Start a task. Go back and forth for an hour. Add new files. Run commands. Review diffs. Ask follow-up questions. Watch the conversation grow to 300K tokens.
The model starts missing things. It references files you mentioned an hour ago but ignores the correction you made ten minutes ago. It suggests a fix you already tried. It loses focus.
Same model. Same context window. Different behavior.
One-shot vs. extended tasks
The million-token context window works differently depending on how you use it.
One-shot queries: you dump a large document, ask a question, get an answer. The model processes the full context once, retrieves what it needs, and responds. This works.
Extended conversations: you go back and forth, the context grows incrementally, and the model has to track state across many turns. The attention mechanism that works for retrieval starts losing track of what matters.
"The million context works differently when you're dumping it, you're one-shotting, you're saying, here's a giant trove of stuff and you ask it a question... But when you have an ongoing conversation back and forth and it grows to hundreds of thousands, it doesn't maintain focus. So, for extended tasks, it's more like 200K in my experience."
Shank,
The advertised number tells you the theoretical maximum. The effective number for agentic workflows is closer to 200K. Sometimes less.
Where the focus breaks
In one set of tests with subtask delegation, the highest any single subtask reached was around 80K tokens. That's not a limitation of the tool. That's where the workflow naturally settled to keep outputs coherent.
When context grows beyond what the model can effectively track, you see specific symptoms:
- The model repeats suggestions it already made
- It ignores constraints you stated earlier in the conversation
- It references outdated assumptions instead of recent corrections
- It becomes confidently wrong about the current state of the codebase
These aren't random failures. They're symptoms of attention degradation in long contexts.
One-shot retrieval vs. extended conversation
| Dimension | One-shot retrieval | Extended conversation |
|---|---|---|
| Context usage | Load once, query once | Incremental growth over many turns |
| Effective limit | Full advertised window (1M tokens) | 200K or less before degradation |
| Attention pattern | Focused retrieval from static dump | Distributed across evolving state |
| Failure mode | Missing information | Losing track of corrections and constraints |
| Recovery strategy | Re-query with different prompt | Start fresh task with summary |
The tradeoff
You have two options for extended tasks.
Option 1: Accept the effective limit. Keep conversations under 200K. Start fresh tasks when context gets stale. Split large workflows into focused subtasks.
Option 2: Invest in context management. Use summarization, hierarchical memory, or explicit state tracking to keep the relevant information in the attention window.
Neither option is free. Fresh tasks lose history. Context management adds complexity and tokens. The question is which cost you prefer.
How Roo Code closes the loop on context management
Roo Code addresses context degradation through its subtask orchestration architecture. Instead of running a single conversation until attention drifts, Roo Code's Orchestrator mode breaks complex workflows into focused subtasks, each staying within the effective context range.
When you use BYOK (bring your own key) with Roo Code, the agent can delegate work to specialized modes - each subtask starts with a clean context and a focused objective. The parent orchestrator maintains the high-level state while child tasks execute without carrying accumulated noise.
The key difference: Roo Code closes the loop by running commands, tests, and iterations within bounded subtasks, then consolidating results back to the orchestrator. This design keeps each unit of work within the 80K range where coherence stays high, while preserving the overall workflow state across the full task.
Why this matters for your team
For a team running multi-hour agent tasks, the million-token marketing number creates a false expectation. You plan for infinite memory. You get degraded focus after 200K.
The impact shows up in subtle ways: tasks that should converge keep looping. Bugs that were "fixed" reappear. The agent suggests approaches you already rejected. You start wondering if the model is broken when the real issue is context length.
If your team is shipping daily through agentic workflows, calibrating for effective context (not advertised context) saves hours of debugging attention drift.
The practical calibration
For extended agentic tasks, plan for 200K effective context. When you hit that threshold, consider starting a fresh task with a summary of the current state.
If you're delegating to subtasks, keeping each under 80K seems to be where coherence stays high. That's not a hard rule; it's an observation from practice.
The million-token window is real. But for ongoing back-and-forth work, the effective limit is lower. Plan accordingly.
Frequently asked questions
Stop being the human glue between PRs
Cloud Agents review code, catch issues, and suggest fixes before you open the diff. You review the results, not the process.