The Benchmark That Ate Itself: Why AI Progress Metrics Keep Collapsing

When GPT-4 was released, one of the first things researchers did was run it on MMLU — the Massive Multitask Language Understanding benchmark, a sprawling set of multiple-choice questions covering medicine, law, history, and dozens of other domains. The model scored impressively. Within months, that score had become almost meaningless. Not because the model got…

read more →