Get the Behavior You Want: A Gentle Introduction to Reinforcement Learning

by Veronica Scerra

Learning by trial, error, and triumph

When I was in the experimental psychology master’s program at Old Dominion University, the most impactful book I read was called “Don’t Shoot the Dog: The New Art of Teaching and Training” by Karen Pryor. It explores the basis of reinforcement and punishment, and how they facilitate learning. Essentially, if you want to increase the likelihood of a behavior, you reinforce it through either positive or negative means.

For instance, I’m addicted to caffeine. I need my cup of coffee in the morning (and mid-morning, and afternoon). Caffeine is a positive reinforcer to our brains, in fact, caffeine is so rewarding that anything you do while drinking coffee will also be reinforced. Now let’s couple that with some societal reinforcers: Seattle is a coffee town, which means that I have several options besides my own kitchen for grabbing a little coffee treat. If I go somewhere that feels charming, has a lively atmosphere and friendly people, then my addiction can be satisfied with an extra helping of societal approval - more positive rewards. Pretty soon, I just don’t feel like I’m enjoying a day as much if I don’t go get my daily dose of legal addictive stimulant in a bright and friendly place. The positive rewards have stacked up, and my coffee drinking as well as my cafe patronage are well-reinforced. On the other hand, if I decide to skip my coffee, maybe in some attempt to ease off of the caffeine, I will suffer a terrible headache (the other half of the Faustian bargain between reward and addiction). When that happens, the best and fastest way to clear the headache is consuming caffeine. In this case, the behavior of drinking coffee is still being reinforced, but this time negatively. A positive reinforcer makes a behavior more likely by giving you something you want in return for doing the behavior - a dose of caffeine, a friendly social interaction, a feeling of community. A negative reinforcer makes a behavior more likely by removing something negative when you do the behavior - I drink the coffee whether I want it or not, because not doing it will lead to unpleasant side-effects. In both cases, the behavior is being encouraged. Reinforced. Animals, including humans, are so deeply attuned to reinforcement that while we shape our behaviors and lives around it, we might not even consciously recognize the rewards.

Okay, but aren't we talking about machines? Machines do not have our intrinsic animal need for reward, but if we tell them they do, they can behave just as well. The way humans learn by taking action, seeing what works, adjusting strategy, and trying again can be translated to machines quite easily - action, feedback, adaptation - this behavioral loop is the heart of Reinforcement Learning (RL). As one of the most intuitive and powerful branches of machine learning, RL mimics the way organisms learn: by interacting with the environment and adjusting behavior based on rewards and penalties.

At its core, RL isn't about labels or predictions - it's about decision making. An agent learns a policy, aka a strategy for choosing actions that will maximize some notion of cumulative reward over time. The context can be flying a glider, playing chess, controlling a robot, or balancing a portfolio - the structure remains remarkably consistent.

The Basic Ingredients

Inspired by behavioral psychology and grounded in formal mathematics, RL formalizes the learning process through a few key components:

Agent: The learner or decision maker
Environment: Everything the agent interacts with
State: The current situation of the environment
Action: A choice the agent makes
Reward: Feedback received after taking an action
Policy: The strategy for choosing actions based on state
Value Function: An estimate of future reward

Put all of these together and you get a feedback loop:

Agent takes an action → Environment responds → Agent gets reward → Update strategy

In my coffee example above, this is how one run of that might go:

I go to a new cafe for my drink → The shop is nice, but more expensive than I’m used to → my coffee is good, but some of the value is reduced because I feel I paid too much → I will come back to this cafe if I don’t have any other options

In a different coffee run:

I go to a cafe for my drink → The staff is friendly, the prices are good → I am pleased with my experience and my purchase → I will come again whenever I am able

The above examples are to emphasize that this loop, although seemingly simple, can be very sensitive to environment and rewards. Policy doesn’t have to be as simple as “Yes, do this” or “No, don’t do this”, you can make your RL strategy as nuanced and responsive as you like, and many environmental factors can influence the rewards.

Origins in Psychology, Evolution with Computing

As I mentioned above, my first appreciation for RL came from my time in psychology. RL has its origins in behavioral psychology and control theory. Techniques like temporal difference learning and dynamic programming set the stage for more computational approaches. In the 1990s and 2000s RL really began to crystallize into a distinct subfield, thanks in part to the foundational work by Richard Sutton and Andrew Barto.

One of the earliest and clearest illustrations of RL’s power is the multi-armed bandit problem - something I explore in detail in another post. It’s a deceptively simple setup that models how agents deal with uncertainty and the exploration/exploitation tradeoff.

From Simple Bandits to Deep Reinforcement Learning

As computing power and data availability grew, RL evolved:

Tabular Methods

Q-learning and SARSA used lookup tables to estimate action values for each state.
Tabular methods are great for small, discrete environments but quickly become infeasible as the number of states grows

Policy Gradient Methods

These algorithms learn the policy directly rather than using a value function.
Ideal for continuous action spaces where stochasticity (randomness) is beneficial.

Functional Approximation and Deep RL

Deep Q-Networks (DQN) and Actor-Critic methods introduced neural networks as function approximators.
These allow RL to scale to complex, high-dimensional tasks like playing Atari games from pixels, or controlling robotic arms to perform tasks
Architectures like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) improve stability and sample efficiency

Reinforcement Learning in the Wild

RL now powers some of the most impressive feats in modern AI, for example:

AlphaGo/AlphaZero: Mastering Go and Chess through self-play
Autonomous driving and drone flight
Recommender systems: Adapt to user behavior in real time
Robotics, energy optimization, personalization and beyond

It’s not just about flash and competition. RL is increasingly important in real-world decision systems where outcomes unfold over time and uncertainty is high. RL could be the key to automating and optimizing processes that can help everyone live better and more sustainable lives.

Challenges and Caveats

RL is very powerful, but not without obstacles.

Sample Inefficiency.In research settings, agents will often require millions of trials or interactions to learn. In complex environments with many moving parts, a single trial will tell us very little. Deep RL is taking steps toward easing the burden of millions of trials.

Exploration versus Exploitation.Without enough incentive to explore many options, agents can get stuck in a local strategy that is suboptimal. A good RL system needs to balance collecting reward (exploitation) with attempts to find new and better rewards (exploration).

Reward Design. Bad incentives can lead to bad behavior. It is important to have ethical and principled reward designs and motivations.

Safety and Alignment. In high-stakes settings, unsafe exploration is unacceptable.

Interpretability. Understanding why an agent does what it does can be a challenge in the more advanced black-box systems of Deep RL.

Where We're Headed

The future of RL is very exciting. Here are a few things I’m looking forward to discussing further:

Offline RL:Learning from previously collected data can make RL more practical for real-world applications.
Multi-Agent Systems:Agents that learn from each other and collaborate to compete with others. A whole little agent ecosystem.
Hierarchical RL: Breaking down complex problems into sub-goals and modular reusable skills.
Causal RL: Moving beyond correlations between actions and rewards toward understanding cause and effect.
Neuro-Symbolic RL: Combining symbolic reasoning with learned policies to generalize more flexibly.

And perhaps most importantly: Human-in-the-loop RL, where systems learn not just from raw data, but from preferences, corrections, and collaborations with human users!

Final Thoughts

Reinforcement learning is not just about solving puzzles or controlling environments - it’s a window into how adaptive systems learn and thrive through trial and error. Like many of the fascinating jumps from animal behavior and decision-making, to formalized computational processes, and ultimately back again, we can learn more about ourselves by teaching behavior to machines, and exploring how that interacts with our world. From bandits to AlphaGo, RL has evolved from a curiosity to a cornerstone of modern AI.