The Thinking That Happens Between Tokens: How Latent Reasoning Is Rewriting What Models Can Do

There’s a moment in human problem-solving that’s easy to overlook: the pause before the answer. Not silence, exactly — something is happening in there, some rearrangement of partial ideas that never surfaces as words. For decades, language models had no equivalent. They generated token after token in a single forward pass, their “thinking” invisible even to themselves. That constraint is now breaking open, and what’s emerging on the other side is genuinely strange and exciting.

The broad category is called latent reasoning — the idea that a model can perform meaningful cognitive work in a continuous, high-dimensional space before committing anything to language. This goes well beyond chain-of-thought prompting, where you ask a model to reason step by step in plain text. That approach is powerful, but it binds the reasoning to the vocabulary. Every intermediate step has to be expressible as a grammatical fragment of English (or whatever language you’re working in), which is a wild constraint to impose on computation. Concepts that don’t tokenize cleanly — spatial relationships, abstract mathematical structure, subtle analogical mappings — get forced through a lossy bottleneck.

What researchers have been developing instead are architectures that let models iterate in embedding space before decoding. One prominent line of work involves “pause tokens” or scratchpad vectors: learned representations that the model can read and write across multiple passes, treating them as working memory rather than output. The model doesn’t have to say what it’s thinking. It just thinks, in whatever geometric structure the training process found useful, and then speaks.

The results, even in early instantiations, have been striking. Tasks that require multi-step planning — complex math, combinatorial puzzles, certain classes of code generation — show measurably better performance when models are given this kind of internal workspace. And the gains aren’t just quantitative. The error patterns change. Models trained with latent reasoning stages fail differently than standard autoregressive models: less susceptible to surface-level pattern matching, more likely to catch their own contradictions before they propagate into the final output.

There’s a deeper architectural story here too. The transformer, for all its power, was never designed with deliberation in mind. Its attention mechanism is brilliant at integrating information across a sequence, but every layer operates on the same sequence in the same direction. Newer hybrid designs are exploring what happens when you let certain layers loop — processing the same representation repeatedly, with each pass conditioned on the previous one. This is reminiscent of recurrent networks in structure but implemented in ways that preserve the parallelism and stability that made transformers trainable at scale. Getting that balance right is a serious engineering challenge, and several research groups are still working out the details.

What makes this moment feel genuinely pivotal is that latent reasoning doesn’t just improve existing capabilities incrementally. It potentially changes the category of problem that’s addressable. Tasks requiring deep search — not just retrieval, but genuine exploration of a solution space — have historically needed external scaffolding: explicit tree-search algorithms, verifiers, tool calls. The promise of latent reasoning is that some of that search could be internalized, baked into a model’s forward pass in a way that’s fluid and trainable end-to-end rather than bolted on as infrastructure.

We’re still in the early stages of understanding what these architectures can do at full scale. The theoretical picture is incomplete, the training dynamics are finicky, and the evaluation benchmarks haven’t caught up. But the trajectory is clear enough to be exciting: models that think before they speak, in representations richer than language, solving problems that currently require elaborate external systems to even approach. That’s a significant frontier, and the best work on it is still ahead.