the (forgotten) ancestry of reinforcement learning, part 1
skinner, hebb, and what lies within
(another one that’s been in my drafts for a while, and I’ve finally gotten round to finishing it off - it’s a bit technical, but still I hope readable; part two next time)
I suspect B.F. Skinner’s name is rarely mentioned in the halls, cafes, corridors and vedic retreats of modern Silicon Valley (if his name is known at all). Maybe his name tickles some vague memory to do with pigeons, or the baby crib, or maybe for a curious word, viz., ‘operant’ conditioning.
I first encountered Skinner as a psych undergrad, and still have my copy of ‘Beyond Freedom and Dignity’ - his then controversial (and, tbh, bonkers and now probably little-read) book on how we should re-organise our societies along operant conditioning and behaviour analytic lines. As Wiki summarises it: ‘Skinner argues that entrenched belief in free will and the moral autonomy of the individual (which Skinner referred to as "dignity") hinders the prospect of using scientific methods to modify behavior for the purpose of building a happier and better-organized society.’ Kind of a vision of a technocratic hellscape, to be honest (I think there is a better way forward than this behavioural technocracy - click here!). He also devised a philosophy of behavioural analysis called ‘radical behaviourism’, of which there are deservedly few adherents today (dive in here, if you want).
I won’t relitigate the Skinner controversies here (of which there were so many) - I find him a curious scientific figure, odd in his refusal to properly consider ‘internal events’, and his derisively referring to the ‘conceptual nervous system’ seems motivated more by a love for a pun than engaging in a serious argument (and a confusing position - for he often argued the skin is not that important a barrier - so why not peer inside and see what the mechanisms giving rise to behaviour actually are?!).
When addressing ‘internal events’, Skinner sometimes paid lip service to ‘internal events’ by redescribing them as ‘private behaviour’; thinking as ‘subvocal responding’; but claiming that everything could be ‘translated’ into the language of operant conditioning and behavioural analysis ends in a translation without scientific meaning. A theory that can redescribe almost any result in its own vocabulary is insulated against refutation, ‘explaining’, everything while actually explaining nothing.
My book: Talking Heads: The New Science of How Conversation Shapes Our Worlds is available for purchase!
However, if you read accounts of reinforcement learning, or papers or press releases announcing some computational breakthrough today, you’ll find names like Bellman, or phrases like dynamic programming, Markov decision processes, temporal-difference learning, and, of course, deep reinforcement learning, but no mention of Skinner. He even seems forgotten sometimes in psychology as well: in the latest issue of Current Directions in Psychological Science, there’s a paper entitled Signatures of Reinforcement Learning in Natural Behavior which doesn’t mention Skinner at all1.
Reinforcement learning is usually presented as one of the great achievements of modern machine learning: an agent, an environment, a reward signal, a policy, and a sequence of actions whose value becomes clear over time.
But this standard history is far too compressed, for it leaves Skinner as mostly a scientific ghost, half-seen from the corner of an eye, a semi-forgotten ‘will o the wisp’, leaving the man who turned reinforcement into a central scientific idea fading in the background.
Skinner is rarely if ever credited - yet he gave us the phrase ‘reinforcement learning’ - the probability of a response happening over time, and how that response probability is shaped by its consequences. And he also devised and quantified schedules of reinforcement, and created the so-called ‘operant box’ to control the environment and impose schedules of reinforcement where the agent acts; the environment answers back; rewards appear or fail to appear; and future behaviour is altered accordingly. He also wrote, with Charles Ferster, the famous monograph ‘Schedules of Reinforcement’ - a doorstep of a book laying out exhaustively the differing kinds of reinforcement schedules that can be devised and tested to control behaviour.
Reinforcement learning, it should be clear, did not spring fully formed from mathematics and engineering, because in fact it grew from an older psychology of trial and error, action and consequence. Recovering that history gives Skinner his place, and it allows us understand more clearly from where modern reinforcement learning orginated, and where it went thereafter, and what its deep limitations are for understanding brain and behaviour.
thorndike first; skinner after
Skinner was not the first with the basic claim behaviour is shaped by its consequences:- EL Thorndike got there earlier with his law of effect: responses followed by ‘satisfying’ outcomes become more likely to reoccur, whereas those followed by some form of noxious outcome were discomfort become less likely over time.
Skinner’s achievement was different, and in some respects more radical, because he took that general insight and turned it into a disciplined experimental science of behaviour. Conceptually, he insisted on a distinction often blurred by people who speak too casually about behaviourism: the distinction between respondent and operant behaviour, where respondent behaviour belongs largely to the experimental tradition of Pavlov, reflexes, and classical conditioning, whereas operant behaviour belongs action, exploration, and selection by consequence (I’m ignoring pavlovian-instrumental transfer, a hot topic in brain imaging a few years back). The organism does something: the environment answers; the probability of that something happening again changes as a result of the feedback from the environment. Skinner was not interested in stimulus and response as though behaviour arises as a chain of triggered reflexes; rather, he was interested in behaviours emitted and differentially stabilised by their consequences.
why skinner faded from the computational story
And yet Skinner is rarely treated as a central forebear of reinforcement learning in AI.
There are at least two reasons that I can see for this:
The first is straightforward: formalisation wins prestige. Bellman and those who followed him gave us the mathematics of sequential decision-making, providing the field with a computable machinery: value functions, recursive decomposition, optimal policies under uncertainty, allowing reinforcement learning to become an engineering and software discipline.
The second reason is that while Skinner gave us the psychology and behavioural engineering of reinforcement learning, he resisted to the point of scientific perversity the kinds of internal mechanistic explanations that later cognitive science, neuroscience, and AI find utterly indispensable for a full and complete understanding of behaviour.
Skinner wanted a science of behaviour grounded in lawful relations between organism and environment, deriding attempts to focus on the ‘conceptual nervous system’, but never caring that such a science will only ever be one of dynamic relations between immediate observables, and incapable of every saying how or why these dynamic relations are as they are.
Imagine a science of automobiles where you can press the accelerator, use the brakes, change gears, steer - but you are never allowed to consider the engine under the bonnet. Skinner was deeply suspicious of hidden mediators (crank shafts, power transmission, internal combustion), internal symbols (what, onboard computers?!), and what he took to be speculative talk about mental machinery (the car measuring its own fuel levels, or reporting on its traction controls).
If something could not be ‘directly’ observed, measured, and controlled through environmental contingencies, he regarded it as a distraction from the real ‘science of behaviour’. It needs to be said aloud that this view is just nuts, a self-limiting view for reasons that make no sense: a comprehensive science of behaviour needs to understand what is going on inside the skull and in the body. There’s another point as well: schedules of reinforcement themselves are abstractions! They can’t be directly observed - you have devise them, figure out to measure them, represent them in a way that compresses all of the things that an organism can do into a single response class, and describe the resulting behaviour as an abstraction and summation of what the organism might be doing (and remember - we’re talking about pigeons tapping buttons or rats pressing levers!)
the black box tautology
Radical behaviourism, especially in its more expansive forms, becomes tautological and untestable: if an organism behaves in a particular way, then that behaviour is attributable to its history of reinforcement - and that’s it (ok, there were occasional evolutionary reasons invoked, but that’s not especially helpful).
Computer science, by contrast, wants explicit variables, operational structure, and computable mechanisms; it wants a definitive operating architecture for software, firmware, and hardwire alike (incidentally, so do car mechanics!). Skinner offered laws of behavioural selection, but he did not care much about the internal schematics of the agent at all, so any properly mechanistic account was just not possible for him.
Furthermore, you cannot understand sophisticated artificial systems solely by describing the rewards used to shape them: you must also understand the internal architecture through which those signals are propagated, transformed, retained, or ignored.
This is one reason he sits so awkwardly in the modern AI story. He helped define the problem of consequence-shaped learning, but he resisted the explanatory move that later made the field technically fruitful: the move inside to describe the functioning of the underlying mechanism.
hebb and the missing mechanism
While Skinner was refining the science of operant conditioning, DO Hebb was proposing a plausible account of internal neural organisation; his work on cell assemblies and reverberatory activity offered something Skinner’s framework struggled to accommodate: internally-sustained, intrinisic neurocognitive processing.
Hebb showed how neural activity could persist after the eliciting stimulus had vanished; how assemblies of neurons could maintain and circulate information in predictable and understandable ways; how the organism was not merely being reinforced by the world; and how it was carrying forward patterns of activity across time.
This matters enormously if you cares about memory, delay, anticipation, or complex cognition. If a reward arrives now, but depends on an action taken much earlier, then the organism, or the artificial agent, must somehow bridge that temporal gap. A purely momentary account will not do.
Hebb supplied part of the missing mechanism: he made it easier to think about the nervous system as active, persistent, and semi-autonomous, offering a picture of how the organism might represent events over time, rather than merely register the latest contingency. His book, The Organisation of Behavior, remains a classic, much-cited today.
intrinsic activity and the restless brain
Modern neuroscience has strengthened the case against any picture of the brain as a device passively awaiting external instruction. Intrinsic neural activity is not just noise around the margins of cognition, for it is part of cognition’s basic fabric. The brain is active even in the absence of a task. Default-mode processing, spontaneous memory sampling, internal simulation, predictive drift: the organism brings itself to every encounter.
This is deeply anti-Skinnerian in spirit: the organism is not merely a locus at which contingencies take effect; it is an endogenously active system, where its internal organisation matters before reward arrives, during learning, and after the immediate stimulus has gone.
why the cognitive revolution left him behind
Once cognitive science emerged, Skinner’s place in the story became even more awkward. The new sciences of mind wanted representation, mechanism, computation, memory buffers, internal models, and latent structure. Skinner built a powerful account of environmental selection, but he was relentlessly suspicious of precisely those explanatory entities. Is it surprising later researchers borrowed the language of reinforcement learning while leaving most of Skinner’s thinking behind?
Bellman is easier to celebrate than Skinner because Bellman can be slotted neatly into the mathematics of optimisation. Hebb is easier to celebrate because Hebb provides a principled precursor of neural computation, including instantiable learning rules that can be empirically investigated. Skinner, by contrast, carries the baggage of a quarrel over whether behaviour can be understood without taking what goes on underneath the bonnet seriously or not. The field quietly decided it had to be taken seriously.
Next time (paywalled): ‘the strange irony of contemporary ai’, and more…
Maybe this is unfair of me - we don’t expect every paper in evo biol to mention Darwin, do we?





