This is the last of the three repositories accompanying the paper Induced Model Matching: Restricted Models Help Train Full-Featured Models (NeurIPS 2024)
@inproceedings{muneeb2024induced,
title = {Induced Model Matching: Restricted Models Help Train Full-Featured Models},
author = {Usama Muneeb and Mesrob I Ohannessian},
booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year = {2024},
url = {https://openreview.net/forum?id=iW0wXE0VyR}
}This repository serves a simple example where POMDP policies can be used to improve the training of MDP policies via Induced Model Matching (IMM).
Other repositories: IMM in Logistic Regression | IMM in Language Modeling
Note
The MDP We first define the MDP 5-tuple
The POMDP The POMDP requires definition of the observation space and function (
The main.py (if run with all the default options), will train a policy without any restricted model information for 200 epochs. It will then evaluate the policy using 10 rollouts. In order to replicate each of the reported curves in the paper, there are options provided in this file that can be set accordingly. We derive main.py from the reward-to-go formulation (2_rtg_pg.py file) from OpenAI Spinning Up documentation and add the IMM component. The file is MIT licensed and a copy has been included in the openai folder for reference.
Since the plots require 30 Monte Carlo training runs for each configuration, and running them sequentially is time consuming, we have provided a run_all.sh BASH script that will parallelize these 30 runs using multiprocessing. Note that run_all.sh (details in next section).
Alternatively, if you simply want to generate plots from cached files, you can skip directly to the plotting section below (cached CSV files for 300 runs have also been provided in this repository).
Important
Secondary Objective Coefficient One of --lambda_ratio or --lambda_param parameters can be used to set the coefficient for the secondary objective (i.e. IMM or noising). While --lambda_param will set --lambda_ratio will set train_and_test.py will determine
Overall Objective The overall objective function minimized is policy gradient objective +
We have also provided a cached version of
mv alphas_cached.h5 alphas.h5If you want to generate
julia install_packages.jlTo obtain the accurate restricted model, we model the POMDP and solve it via FIB to obtain the
julia simplegrid_pomdp.jlWe recommend running this in a CPU only PyTorch environment (if GPU version is used, CUDA will need to be initialized repeatedly, eventually slowing things down). To this end, the official PyTorch Docker container (pytorch/pytorch) is recommended. Also, the following additional packages will be required:
pip install gymnasium h5pyTo maximally utilize multiprocessing, you are encouraged to edit PARALLEL_JOBS inside run_all.sh, as per the provided instructions before calling it as:
./run_all.shThe results can be visualized by running python plot_main.py.
By default, the cached CSVs in csv_cached directory will be used. This can be overridden by adding an extra --csv_dir csv flag.
For completeness, we will also demonstrate how we obtain the plots in the paper that we use to create the hardcoded rules that give us
./schedulers/tune_lambda.shTo visualize the generated CSVs
python plot_lambda_search.pyAgain, by default, the cached CSVs in csv_cached/lambda_search directory will be used. This can be overridden by adding an extra --csv_dir csv/lambda_search flag.
We experimentally determined the same rule to also apply in the case of softmaxed POMDP policies (i.e. with additional --pomdp_temp 0.5 or --pomdp_temp 1.0 argument). The rule has been integrated into main.py and will be applied if we specify --lambda_ratio -1 to automatically set the