The k-Armed Bandit: Reinforcement Learning's Toy Soldier

by Veronica Scerra

A ground-up exploration of Reinforcement Learning methods and models.

I am not someone who likes to jump into the middle of anything. If I find something interesting, I have to go back to its origins and build from the bottom. Any addition to my mind palace has to be built from the ground-up. I know - beginnings are messy, unpolished, and often too simplistic to have much cache, especially in areas where the newest advancements are generating so much excitement, but I feel like I can't fully understand where we are unless I know whever we've been. To that end, my interest in reinforcement learning (RL) algorithms has to start with a thorough appreciation for k-armed bandits and how they can be used to illustrate the principles of RL.

The beauty of RL is that it offers a framework for decision making under uncertainty - where an agent learns to take actions that maximize cumulative reward through trial and error. A natrual starting point for exploring RL is the k-armed bandit problem, a simplified scenario that captures the fundamental exploration-exploitation tradeoff at the heart of RL.

What is a k-Armed Bandit?

Imagine a row of k slot machines (also called "bandits" because they will take your money), each with an unknown and potentially different probability of paying out a reward (called a reward distribution). Your task is to play the machines one at a time to maximize your total reward over time. In this scenario, each choice you make involves a decision: Do you exploit the arm that has given the highest reward so far? Or do you explore less-tested arms to see if they might be better? This tension between exploration and exploitation is what makes the problem - and reinforcement learning - both intellectually rich, and practically valuable.

Sure, you like your dentist, but maybe this new dentist closer to your home might be better. Do you explore a new options for your next cleaning, possibly at a loss (but also possibly for the better), or do you stay where you are, even if it's not ideal, because you know what you're getting there? These are decisions we make all the time, and we can teach machines to make them with high effiency to maximize reward

Algorithms Explored

To start my series, I've implemented and compared several foundational algorithms for solving the k-armed bandit problem. Each one balanced exploration and exploitation differently:

Greedy Algorithm

Always selects the action with the highest estimated value. Always.
Simple, but prone to getting stuck in local optima due to lack of exploration.

Epsilon-Greedy

With probability ε, it chooses a random arm (exploration); otherwise it chooses the greedy option (exploitation).
Introduces a tunable exploration parameter to improve learning.

Upper Confidence Bound (UCB)

Selects the arm with the highest upper confidence bound based on reward uncertainty.
More principled exploration that balances high mean rewards and low sampling frequency.

Gradient Bandits

Uses a softmax policy to update preferences over actions using gradient ascent on expected reward.
Particularly useful in non-stationary environments (aka: environments where the rewards are not fixed) and for learning stochastic policies

Thompson Sampling

A Bayesian approach that maintains a probability distribution over the expected reward of each arm.
At each time step, it samples from these distributions and selects the arm with the highest sampled value.
Naturally balances exploration and exploitation: arms with more uncertainty are more likely to be chosen early on, while arms with higher observed rewards dominate over time.

Thompson Sampling is particularly effective in environments with sparse data, or non-stationary rewards, and is computationally efficient while performing near-optimally in many theoretical and empirical settings.

Visual Insights and Experimental Notebook

The notebook that corresponds with this post (found here) illustrates:

Real-time action-value estimates
Trade-offs in parameter settings
Comparative performance across different random seeds
Mathematical formulation for each algorithm

This work lays the foundation for more complex reinforcement learning problems like multi-step decision making, function approximation, and dynamic environments.

General Model Comparison

Algorithm	Strengths	Weaknesses
Greedy	Simple, fast	Gets stuck in suboptimal actions
Epsilon-Greedy	Simple balance of exploration	Doesn't adapt exploration dynamically
UCB	Uses confidence intervals for exploration	Assumes rewards are stationary
Gradient Bandit	Works well for continuous rewards	Sensitive to learning rate α
Thompson Sampling	Optimally balances exploration/exploitation	Requires Bayesian modeling

What's Next?

Having built a strong foundation with bandits, I plan to expand this series into:

Non-stationary bandits with decaying reward probabilities
Contextual bandits, incorporating features for smarter decision making
Markov Decision Processes (MDPs) and multi-step environments
Q-learning and SARSA algorithms
Deep Reinforcement Learning using neural networks to approximate value functions (the good stuff)

This ongoing series project aims to demystify reinforcement learning concepts through hands-on code, clear explanations, and visualizations. Learn with me!

View Repository