Add some test cases to ensure that all of the evaluation methods we support are actually giving us the outputs we’re expecting.
This could be done very simply, with a few short episodes of interaction simulated on a very simple MDP, both with and without skills. See the existing run_agent test cases for inspiration.