In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no practical methods exist for determining high-confidence policy performance bounds in the inverse reinforcement learning setting---where the true reward function is unknown and only samples of expert behavior are given. We propose a sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the α-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function. We evaluate our proposed bound on both a standard grid navigation task and a simulated driving task and achieve tighter and more accurate bounds than a feature count-based baseline. We also give examples of how our proposed bound can be utilized to perform risk-aware policy selection and risk-aware policy improvement. Because our proposed bound requires several orders of magnitude fewer demonstrations than existing high-confidence bounds, it is the first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy.
Follow the instructions below to reproduce results in our AAAI 2018 and our AAAI 2017 Fall Symposium papers.
@inproceedings{brown2018probabilistic,
author = {Brown, Daniel S. and Niekum, Scott},
title = {Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning},
booktitle = {AAAI Conference on Artificial Intelligence},
year = 2018,
url={https://arxiv.org/abs/1707.00724}
}
@inproceedings{brown2017toward,
author = {Brown, Daniel S. and Niekum, Scott},
title = {Toward Probabilistic Safety Bounds for Robot Learning from Demonstration},
booktitle = {AAAI Fall Symposium: Artificial Intelligence for Human-Robot Interaction},
year = 2017,
url={https://aaai.org/ocs/index.php/FSS/FSS17/paper/view/16023/15282}
}
- Matplotlib (for generating figures): https://matplotlib.org/users/installing.html
- SDL (only needed for car driving experiments):
sudo apt-get install libsdl-image1.2-dev libsdl1.2-dev libsdl-ttf2.0-dev
- Make a build directory:
mkdir build - Make a data directory to hold results:
mkdir data
Infinite Horizon GridWorld (Figure 2 in AAAI 2018 paper)
- Use
make gridworld_basic_expto build the experiment. - Execute
./gridworld_basic_expto run. Data will be output to./data/gridworld - Experiment will take some time to run since it runs 200 replicates for each number of demonstrations. Experiment parameters can be set in
src/gridWorldBasicExperiment.cpp. - Once experiment has finished run
python scripts/generateGridWorldBasicPlots.pyto generate figures used in paper. - You should get something similar to the following two plots
Sensitivity to Confidence Parameter (Figure 3 in AAAI 2018 paper)
- Use
make gridworld_noisydemo_expto build the experiment. - Execute
./gridworld_noisydemo_expto run. Data will be output to./data/gridworld_noisydemo_exp/ - Experiment will take some time to run since it runs 200 replicates for each number of demonstrations. Experiment parameters can be set in
src/gridWorldNoisyDemoExperiment.cpp. - Once experiment has finished run
python scripts/generateNoisyDemoPlots.pyto generate figures used in paper. - You should get something similar to the following two plots
- Note that the bounds when c=0 are different than shown in paper. We are working on determining the reason for this discrepancy.
Comparison with theoretical bounds (Table 1 in in AAAI 2018 paper)
- Use
make gridworld_projection_expto build the experiment. - Execute
./gridworld_projection_expto run. Data will be output to./data/abbeel_projection/ - Experiment will take some time to run since it runs 200 replicates for each number of demonstrations. Experiment parameters can be set in
src/gridWorldProjectionEvalExperiment.cpp. - Once experiment has finished run
python scripts/generateProjectionEvalData.pyto generate data used in paper. - We reran the experiment from our paper and got the following results (slightly different from paper due to random seeding):
| Bound | 1 demo | 5 demos | 9 demos | 23052 demos | Ave Accuracy |
|---|---|---|---|---|---|
| 0.95-VaR EVD Bound | 0.9392 | 0.2570 | 0.1370 | - | 0.98 |
| 0.99-VaR EVD Bound | 1.1448 | 0.2972 | 0.1575 | - | 1.0 |
| Syed and Schapire 2008 | 142.59 | 63.77 | 47.53 | 0.9392 | 1.0 |
- Use
make driving_experimentto build experiment. - Execute
./driving_experiment right_safe,./driving_experiment on_road, and./driving_experiment nastyto run experiments and output results to./data/driving/. - Once all experiments have finished run
python scripts/calculateDrivingRankings.pyto output results. - To calculate the average number of collisions use
make driving_ccounts, then run./driving_ccounts [POLICY]where[POLICY]can beright_safe,on_road, ornasty. You should get the results shown in the table above. - You should get results similar to the following:
| Eval Policy | Collisions | WFCB bound | VaR 95 bound |
|---|---|---|---|
| right-safe | 0 | 5.52 | 0.85 |
| on-road | 13.65 | 1.93 | 1.09 |
| nasty | 42.75 | 4.11 | 2.44 |
- Gifs of the different policies are shown below
From left to right and top to bottom: demonstration, right-safe, on-road, and nasty policies
Policy Improvement (Figure 4 in AAAI 2018 paper)
- Use
make improvement_expto build the experiment. - Execute
./improvement_expto run. - The minimum VaR policy will be printed to the terminal.
Demonstration Sufficiency (Figure 4 in AAAI 2017 Fall Symposium Paper)
- Use
make demo_sufficiency_expto build the experiment. - Execute
./demo_sufficiency_expto run. Data will be output to./data/demo_sufficiency/ - Once experiment has finished run
python scripts/generateDemoSufficiencyPlot.pyto generate plot 4 (b). - You should get the following figure.
- Note that given a non-zero safety threshold on Value-at-Risk, say ε = 0.01, the agent would be able to report that it had learned the given task after two demonstrations, whereas using only feature counts makes it seem like three demonstrations are needed.
- To experiment with the Car Simulator you can run Q-learning on a desired reward function using
/src/drivingTester.cpp. - Simply compile using
make driving_test, and run as./driving_testand you will see a simulation of the car driving using the learned policy. - To try other rewards, simply edit
double featureWeights[]in/src/drivingTester.cpp.








