antecendents of reinforcement learning, part 2

and the strange ironies of contemporary ai

Apr 28, 2026

∙ Paid

We have now constructed the most elaborate conceptual nervous systems in human history: large language models and related deep architectures of astonishing internal complexity, full of hidden states, layered representations, recurrent dependencies, and richly structured latent spaces. Their internal organisation is far closer in spirit to Hebb than Skinner.

My book: Talking Heads: The New Science of How Conversation Shapes Our Worlds

But how do we train them? Paradoxically, by behaviourist means. I’ll discuss this more below.

For prior context, see the last post:

the (forgotten) ancestry of reinforcement learning, part 1

Shane O'Mara

Apr 4

the (forgotten) ancestry of reinforcement learning, part 1

(another one that’s been in my drafts for a while, and I’ve finally gotten round to finishing it off - it’s a bit technical, but still I hope readable; part two next time)

Read full story

Reinforcement learning from human feedback is one obvious example: desired outputs are differentially stabilised by evaluative signals. We reward some behaviours, discourage others, and shape performance through consequence. In practice, we treat these immense internal systems as black boxes whose behaviour must be adjusted from the outside. We do not yet understand them well enough to engineer their inner workings directly, so we fall back on an updated form of behavioural shaping.

Skinner has returned through the back door because when faced with systems too complex to understand completely, consequence-based control remains useful for changing their behaviour. We don’t need theory of mind to comprehend an LLM; we just need to engage in shaping its behaviour by manipulating the consequences of the text sequences it generates.

The irony: critics of radical behaviourism thought it had been superseded by richer accounts of internal mechanism - yet, in the age of deep learning, we often find ourselves managing immensely intricate internal architectures in thoroughly behaviourist ways.

tacit knowledge

What do we do about tacit, implicit, non-declarative, implicit knowledge?

Scaling word models, which is to say feeding machines vast quantities of declarative content, produces extraordinary fluency of text - reams and reams of it. But it has not, by itself, delivered general intelligence in any full, implementable, sense, because a great deal of human intelligence is non-declarative. It is procedural, embodied, subcortical, and difficult or impossible to articulate explicitly. It is the peculiar weighted feel of an instrument in surgeon’s hand while probing deep tissues, the subtle timing of a comedian, the balance of a cyclist at variable speeds on a bumpy road negotiating oncoming traffic, the practical judgement of a seasoned craftworker, the composer who knows that judicious silence and dropped beats are essential to music too.

We know more - much more - than we can tell (and sometimes we hallucinate - telling more than we can know - we don’t have access to our own cognitive processes - we can only report on their outputs).

There’s load more BTL:

breaking the black box open;
how would Skinner’s schedules of reinforcement be integrated into modern RL?;
fixed ratio schedules and deterministic rewards;
variable ratio schedules and stochastic reward;
fixed interval schedules and temporal credit assignment;
variable interval schedules and partially observable environments;
variable interval schedules and partially observable environments;
reward schedules as environment design;
why this matters for modern AI;
the deeper point;
the history we ought to tell

Continue reading this post for free, courtesy of Shane O'Mara.

Or purchase a paid subscription.