Data Exhaust
Tuesday, January 28
From Gwern’s (extraordinarily prescient) 2020 essay, The Scaling Hypothesis:
Humans, one might say, are the cyanobacteria of AI: we constantly emit large amounts of structured data, which implicitly rely on logic, causality, object permanence, history—all of that good stuff. All of that is implicit and encoded into our writings and videos and ‘data exhaust’. A model learning to predict must learn to understand all of that to get the best performance; as it predicts the easy things which are mere statistical pattern-matching, what’s left are the hard things. AI critics often say that the long tail of scenarios for tasks like self-driving cars or natural language can only be solved by true generalization & reasoning; it follows then that if models solve the long tail, they must learn to generalize & reason.
Nearly five years later, it’s becoming increasingly clear that AI models are solving the long tail — the hard stuff. Multiple benchmarks are being developed to test the very limits of human knowledge.
It’s astonishing that all of this could be reconstructed merely from human “data exhaust.” Just as cognition leads to useful, material changes in the world, an unthinking algorithm looking closely enough at those material changes can essentially reverse-engineer cognition.
The relationship between intelligence and the world is a two-way street.
Maybe we should have always known this was possible — after all, isn’t it what human babies do? But to see a computer doing it is all the confirmation we could ask for. There are traces of thought in everything around us, and from it, new thinkers can emerge.


