The Reasoning Models That Think in Drafts: How Iterative Self-Refinement Is Rewriting What AI Can Solve

Give a frontier reasoning model a hard mathematical olympiad problem and something unusual happens: it argues with itself. It proposes an approach, notices a flaw three steps in, backtracks, tries a different decomposition, and eventually converges on a proof that checks out. The final answer looks clean. The path to it looked almost human.

This is the capability at the center of the most interesting architectural shift in frontier models right now — not scale, not multimodality, but the ability to treat generation as a drafting process rather than a single forward pass. The idea goes by several names: extended thinking, chain-of-thought with verification, iterative self-refinement. The underlying intuition is that reasoning quality isn’t just a function of model size; it’s a function of how much compute you’re willing to spend at inference time, and how well the model can use that compute to critique and revise its own outputs.

OpenAI’s o-series models demonstrated the proof of concept compellingly. The o3 generation, evaluated on ARC-AGI and competition mathematics benchmarks, showed that a model trained to spend more tokens on hard problems — essentially to “think longer” — could reach performance levels that simply didn’t exist at any parameter count the year before. DeepSeek’s R1 series independently arrived at similar conclusions through reinforcement learning on reasoning traces, achieving remarkable results with a fraction of the training budget of its Western counterparts. The message from both directions: the architecture of thought matters as much as the scale of the model.

What makes the current generation genuinely exciting is how the self-refinement loop is becoming more structured. Early chain-of-thought was essentially free-form inner monologue — useful, but noisy. The newer approach trains models to produce reasoning that has identifiable phases: problem decomposition, candidate generation, verification, and revision. Some systems are beginning to learn when to call external tools mid-reasoning, using a Python interpreter or symbolic solver to check a step before continuing. The reasoning trace becomes less like a stream of consciousness and more like a working draft with margin notes.

There’s a deep connection here to the theoretical concept of “System 2” computation — slow, deliberate, compositional thinking as opposed to fast pattern-matching retrieval. For years, critics argued that transformers were fundamentally incapable of this kind of processing; they were autocomplete engines, not reasoners. The empirical results of the last 18 months have complicated that story significantly. Whether what these models do constitutes “real” reasoning is a philosophical question that may never fully resolve, but what they can accomplish — multi-step mathematical derivations, competitive programming solutions, complex scientific problem-solving — is real and measurable.

The frontier is now pushing toward something even more interesting: reasoning models that can operate over much longer horizons with maintained coherence. The challenge isn’t just thinking longer, it’s thinking consistently — not contradicting an assumption made 8,000 tokens ago, maintaining a structured goal hierarchy over an extended derivation, knowing when a promising-looking branch is actually a dead end. Context management during extended reasoning is an open research problem that several labs are attacking from different angles, including learned attention mechanisms that prioritize earlier critical steps and explicit working-memory modules that track intermediate conclusions.

The downstream implications extend well beyond benchmark performance. A model that genuinely improves with more inference compute becomes a different kind of tool: you can budget reasoning effort to problem difficulty, pour resources into hard scientific questions, and expect better answers in proportion to the investment. That’s a fundamentally new relationship between compute and capability. As inference infrastructure gets cheaper and faster, the ceiling on what these systems can work through in a single session keeps rising. The hard problems — the ones we’ve always said would need something more — are starting to look negotiable.