by Veronica Scerra
A ground-up exploration of Reinforcement Learning methods and models.
I am not someone who likes to jump into the middle of anything. If I find something interesting, I have to go back to its origins and build from the bottom. Any addition to my mind palace has to be built from the ground-up. I know - beginnings are messy, unpolished, and often too simplistic to have much cache, especially in areas where the newest advancements are generating so much excitement, but I feel like I can't fully understand where we are unless I know whever we've been. To that end, my interest in reinforcement learning (RL) algorithms has to start with a thorough appreciation for k-armed bandits and how they can be used to illustrate the principles of RL.
The beauty of RL is that it offers a framework for decision making under uncertainty - where an agent learns to take actions that maximize cumulative reward through trial and error. A natrual starting point for exploring RL is the k-armed bandit problem, a simplified scenario that captures the fundamental exploration-exploitation tradeoff at the heart of RL.
Imagine a row of k slot machines (also called "bandits" because they will take your money), each with an unknown and potentially different probability of paying out a reward (called a reward distribution). Your task is to play the machines one at a time to maximize your total reward over time. In this scenario, each choice you make involves a decision: Do you exploit the arm that has given the highest reward so far? Or do you explore less-tested arms to see if they might be better? This tension between exploration and exploitation is what makes the problem - and reinforcement learning - both intellectually rich, and practically valuable.
Sure, you like your dentist, but maybe this new dentist closer to your home might be better. Do you explore a new options for your next cleaning, possibly at a loss (but also possibly for the better), or do you stay where you are, even if it's not ideal, because you know what you're getting there? These are decisions we make all the time, and we can teach machines to make them with high effiency to maximize reward
To start my series, I've implemented and compared several foundational algorithms for solving the k-armed bandit problem. Each one balanced exploration and exploitation differently:
ε
, it chooses a random arm (exploration); otherwise it chooses the greedy option (exploitation).Thompson Sampling is particularly effective in environments with sparse data, or non-stationary rewards, and is computationally efficient while performing near-optimally in many theoretical and empirical settings.
The notebook that corresponds with this post (found here) illustrates:
This work lays the foundation for more complex reinforcement learning problems like multi-step decision making, function approximation, and dynamic environments.
Algorithm | Strengths | Weaknesses |
---|---|---|
Greedy | Simple, fast | Gets stuck in suboptimal actions |
Epsilon-Greedy | Simple balance of exploration | Doesn't adapt exploration dynamically |
UCB | Uses confidence intervals for exploration | Assumes rewards are stationary |
Gradient Bandit | Works well for continuous rewards | Sensitive to learning rate α |
Thompson Sampling | Optimally balances exploration/exploitation | Requires Bayesian modeling |
Having built a strong foundation with bandits, I plan to expand this series into:
This ongoing series project aims to demystify reinforcement learning concepts through hands-on code, clear explanations, and visualizations. Learn with me!