This repository contains the implementation of various multi-armed bandit algorithms and a dashboard for visualizing their performance. The goal is to compare the effectiveness of different algorithms in maximizing rewards and minimizing regret over time.
This project uses the branch v1.0-SKILL2025 for the official SKILL 2025 submission. This documentation and code refers to the corresponding branch of this repository.
The following algorithms are implemented, each with its own set of tuning parameters:
-
ETC (Explore-then-Commit): Explores all available arms for a certain number of rounds before committing to the arm with the highest estimated reward.
- Tuning Parameters:
exploration_rounds - Scenarios:
exploration_rounds: 10exploration_rounds: 100exploration_rounds: 1000exploration_rounds: 10000exploration_rounds: 100000
- Tuning Parameters:
-
Epsilon-Greedy: Balances exploration and exploitation by choosing a random action with probability epsilon and the action with the highest estimated reward with probability (1 - \epsilon).
- Tuning Parameters:
epsilon - Scenarios:
epsilon: 0.5epsilon: 0.1epsilon: 0.01epsilon: 0.05epsilon: 0.005
- Tuning Parameters:
-
UCB (Upper Confidence Bound): Selects the arm with the highest upper confidence bound to balance exploration and exploitation.
- Tuning Parameters: None
-
UCB-Tuned: Adjusts the confidence bound by considering the variance of the rewards.
- Tuning Parameters: None
-
UCB-V: Incorporates variance estimates into the upper confidence bounds.
- Tuning Parameters:
theta,c,b - Scenarios:
theta: 1,c: 1,b: 1
- Tuning Parameters:
-
PAC-UCB: Guarantees with high probability that the regret is close to the optimal policy.
- Tuning Parameters:
c,b,q,beta - Scenarios:
c: 1,b: 1,q: 1.3,beta: 0.05
- Tuning Parameters:
-
UCB-Improved: Enhances UCB with more sophisticated exploration strategies.
- Tuning Parameters:
delta - Scenarios:
delta: 1
- Tuning Parameters:
-
EUCBV (Efficient-UCB with Variance): Uses empirical estimates of variance to adjust the upper confidence bounds.
- Tuning Parameters:
rho - Scenarios:
rho: 0.5
- Tuning Parameters:
The bandit model used in this repository focuses on a multi-armed bandit problem with Bernoulli-distributed arms. The arms are set with the reward probabilities for each arm of
Each algorithm is run for 100 rounds, and the results are stored in separate directories for different time steps. Additionally, there is a 'results_average' file for each algorithm, providing the average values for each time step based on 100 samples.
Visualizations and dashboards were created using Plotly and Dash. There is a dashboard available with the following
- Average Total Reward Over Time: Displays how effectively each algorithm maximizes rewards over time.
- Average Regret Over Time: Shows how well each algorithm minimizes regret over time.
- Reward Distribution: A boxplot showing the distribution of zero and one rewards for each algorithm.
- Distribution of Total Regret at Timestep 100,000: A histogram of total regret values at timestep 1,000,000 across 100 iterations for a selected algorithm.
- Value-at-Risk (VaR) Function: Displays the VaR function for alpha values 0.01, 0.05, and 0.1, indicating the maximum potential loss at a given confidence level.
- Proportion of Suboptimal Arms Pulled: Shows the proportion of suboptimal arm selections compared to all selections up to each timestep.
First, you need to clone the repository from GitHub to your local machine. Open your terminal (or command prompt) and run the following command:
git clone https://github.com/eelisee/bandit_playground.git
cd bandit_playgroundOnce the repository is cloned, checkout the Version 1.0 branch by running:
git checkout -b v1.0-SKILL2025 origin/v1.0-SKILL2025It is strongly recommended to use a virtual environment to manage the project's dependencies. You can create and activate a virtual environment by running the following commands:
For macOS/Linux:
python3 -m venv .venv
source .venv/bin/activateFor Windows:
python -m venv .venv
.venv\Scripts\activateOnce the virtual environment is activated, you need to install the required Python packages. Install them by running the following command:
pip install -r requirements.txtThis command will install all the dependencies listed in the requirements.txt file. Ensure that all packages install without errors.
Once the installation is complete, you can start the dashboard by running the following command:
python src/dashboard.pyAfter the command runs, your default web browser should automatically open with the dashboard at this URL:
http://127.0.0.1:8050If it doesn't open automatically, you can manually copy and paste this URL into your browser.
The code is designed for easy extensibility:
- Configuration Management: All simulation parameters and settings are centrally defined in the configuration file
config.py, enabling quick adjustments without changing the core code. - Adding New Algorithms: New multi-armed bandit algorithms can be integrated easily following a clear structure.
- Customizing Plots: Additional plots for analysis and visualization can be added with minimal changes.
- Flexible Scenario Definition: Different simulation scenarios can be defined via configurable arm distributions.
Detailed instructions for extending the simulation are available in the following documentation files:
- Instructions to change configuration
- Instructions to add new algorithms
- Instructions to add new plots
Elise Wolf – [elise.marie.wolf@students.uni-mannheim.de] - University of Mannheim Affiliation
If you use this project, please cite @misc{wolf2025frameworkfairevaluationvarianceaware, title={A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms}, author={Elise Wolf}, year={2025}, eprint={2510.27001}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2510.27001}, }
This project is licensed under the MIT License – see the LICENSE file for details.