The Night a Backgammon Program Shook the Foundations of Expert Systems

In 1979, a program called BKG 9.8 defeated the reigning world backgammon champion, Luigi Villa, 7-1 in a match played in Monte Carlo. It was the first time a computer had beaten a human world champion at any board game. Most people have never heard of it. That near-total obscurity is itself instructive — because what happened next, and why the win was almost immediately dismissed, tells you something profound about the long, stubborn road to where AI stands today.

BKG 9.8 was built by Hans Berliner at Carnegie Mellon University. It ran on a PDP-10, used heuristic evaluation functions hand-crafted by Berliner himself, and won largely because backgammon involves dice — and on that particular night, the program got exceptionally lucky rolls. Berliner said so himself, in print, immediately after the match. The expert systems community largely agreed: this was a statistical fluke, not a demonstration of genuine strategic mastery. Move along, nothing to see here.

But here is the thing. Eleven years later, a program called TD-Gammon arrived and made the dismissal look foolish in retrospect. Gerald Tesauro at IBM built it using temporal-difference reinforcement learning — the same family of ideas that would eventually power AlphaGo and modern RL pipelines. TD-Gammon trained almost entirely through self-play, generating its own experience, and reached a level of play that genuinely surprised the backgammon world. It didn’t just win; it discovered novel opening strategies that human grandmasters subsequently adopted. A machine had found things in the game that centuries of human play had missed.

The arc between BKG 9.8 and TD-Gammon is a microcosm of AI’s entire trajectory. The first win was brittle, luck-dependent, and required enormous human expertise baked in by hand. The second win was robust, self-generated, and produced knowledge that transferred back to humans. That shift — from hand-engineered heuristics to learned representations — is arguably the most important conceptual move in the history of the field. It just took a decade and a half of quiet, unglamorous research for it to fully crystallize.

What makes this history worth sitting with is the compounding nature of the ideas involved. Tesauro’s temporal-difference approach drew directly on Richard Sutton and Andy Barto’s theoretical work on reinforcement learning, which itself built on earlier control theory. TD-Gammon then influenced how researchers thought about self-play more broadly. When DeepMind built AlphaZero in 2017, training from scratch on Go, chess, and shogi with no human game data, they were standing on a conceptual lineage that ran straight back through TD-Gammon. And the self-play paradigm has since migrated far beyond board games into protein structure prediction, chip design, and large language model fine-tuning via RLHF.

There is something almost vertiginous about tracing that line. A backgammon program on a 1970s mainframe, half-dismissed as a lucky fluke, sits at the root of a research tradition that now shapes how frontier models are trained at trillion-parameter scale. The core insight — that an agent can bootstrap its own training signal through interaction and self-evaluation — keeps arriving in new forms, in new domains, apparently without exhausting its generative power.

We are still early in understanding how far that insight reaches. Current work on process reward models, on AI systems critiquing and revising their own outputs, on agents running long-horizon experiments and evaluating the results: all of it carries the same genetic material that Tesauro was working with in a lab at IBM in 1992. The dice rolls in Monte Carlo feel very far away. The ideas they eventually catalyzed feel closer than ever.