← Back to Reinforcement Learning

The k-Armed Bandit: Reinforcement Learning's Toy Soldier

by Veronica Scerra

A ground-up exploration of Reinforcement Learning methods and models.

I am not someone who likes to jump into the middle of anything. If I find something interesting, I have to go back to its origins and build from the bottom. Any addition to my mind palace has to be built from the ground-up. I know - beginnings are messy, unpolished, and often too simplistic to have much cache, especially in areas where the newest advancements are generating so much excitement, but I feel like I can't fully understand where we are unless I know whever we've been. To that end, my interest in reinforcement learning (RL) algorithms has to start with a thorough appreciation for k-armed bandits and how they can be used to illustrate the principles of RL.

The beauty of RL is that it offers a framework for decision making under uncertainty - where an agent learns to take actions that maximize cumulative reward through trial and error. A natrual starting point for exploring RL is the k-armed bandit problem, a simplified scenario that captures the fundamental exploration-exploitation tradeoff at the heart of RL.

What is a k-Armed Bandit?

Imagine a row of k slot machines (also called "bandits" because they will take your money), each with an unknown and potentially different probability of paying out a reward (called a reward distribution). Your task is to play the machines one at a time to maximize your total reward over time. In this scenario, each choice you make involves a decision: Do you exploit the arm that has given the highest reward so far? Or do you explore less-tested arms to see if they might be better? This tension between exploration and exploitation is what makes the problem - and reinforcement learning - both intellectually rich, and practically valuable.

Sure, you like your dentist, but maybe this new dentist closer to your home might be better. Do you explore a new options for your next cleaning, possibly at a loss (but also possibly for the better), or do you stay where you are, even if it's not ideal, because you know what you're getting there? These are decisions we make all the time, and we can teach machines to make them with high effiency to maximize reward

Algorithms Explored

To start my series, I've implemented and compared several foundational algorithms for solving the k-armed bandit problem. Each one balanced exploration and exploitation differently:

  1. Greedy Algorithm
    • Always selects the action with the highest estimated value. Always.
    • Simple, but prone to getting stuck in local optima due to lack of exploration.
  2. Epsilon-Greedy
    • With probability ε, it chooses a random arm (exploration); otherwise it chooses the greedy option (exploitation).
    • Introduces a tunable exploration parameter to improve learning.
  3. Upper Confidence Bound (UCB)
    • Selects the arm with the highest upper confidence bound based on reward uncertainty.
    • More principled exploration that balances high mean rewards and low sampling frequency.
  4. Gradient Bandits
    • Uses a softmax policy to update preferences over actions using gradient ascent on expected reward.
    • Particularly useful in non-stationary environments (aka: environments where the rewards are not fixed) and for learning stochastic policies
  5. Thompson Sampling
    • A Bayesian approach that maintains a probability distribution over the expected reward of each arm.
    • At each time step, it samples from these distributions and selects the arm with the highest sampled value.
    • Naturally balances exploration and exploitation: arms with more uncertainty are more likely to be chosen early on, while arms with higher observed rewards dominate over time.

Thompson Sampling is particularly effective in environments with sparse data, or non-stationary rewards, and is computationally efficient while performing near-optimally in many theoretical and empirical settings.

Visual Insights and Experimental Notebook

The notebook that corresponds with this post (found here) illustrates:

This work lays the foundation for more complex reinforcement learning problems like multi-step decision making, function approximation, and dynamic environments.

General Model Comparison

Algorithm Strengths Weaknesses
Greedy Simple, fast Gets stuck in suboptimal actions
Epsilon-Greedy Simple balance of exploration Doesn't adapt exploration dynamically
UCB Uses confidence intervals for exploration Assumes rewards are stationary
Gradient Bandit Works well for continuous rewards Sensitive to learning rate α
Thompson Sampling Optimally balances exploration/exploitation Requires Bayesian modeling

What's Next?

Having built a strong foundation with bandits, I plan to expand this series into:

This ongoing series project aims to demystify reinforcement learning concepts through hands-on code, clear explanations, and visualizations. Learn with me!

View Repository