Skip to content

fixed bug with policy_gradient and skeleton #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions pyrl/agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,156 @@ pyrl.agents
=========

Reinforcement Learning agents that have been implemented in python using the RLGlue framework.

The following sections describe the algorithms that are implemented in the library and provide some useful references. Different bases can be used with linear function approximators (check the specific options of the agents).

---
### skeleton\_agent.py
Base class for the agents. Do not use directly (picks random actions all the time).

---
### sarsa\_lambda\_ann.py
Implementation of SARSA (with eligibility traces) and neural networks for approximate Q estimation. The main reference for this algorithm is:

Rummery G. and Niranjan, M. (1994). _On-line q-learning using connectionist systems_. Technical Report CUED/F-INFENG/TR 166, Cambridge University, Engineering Department.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
This method uses a single network to estimate Q. The number of inputs in the network is the size of the feature space; the number of outputs is the number of possible discrete actions. In contrast, the original paper proposed to use a *single* network per action.

The original paper also suggests to decrease the exploration rate as the agent learns more from the environment. It looks like this implementation rather has a constant exploration rate (epsilon).

---
### sarsa\_lambda.py
Implementation of SARSA (with eligibility traces) and a linear approximator for Q.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
This agent is similar to sarsa\_lambda\_ann.py but with a simpler approximator for Q.

---
### qlearning.py
Implementation of Q-Learning with a linear function approximator. Different to SARSA, Q learning uses the best action found so far to update Q's estimate and acts greedily with respect to its estimates.

The main reference for Q-Learning is:

C. J. Watkins. [_Learning from Delayed Rewards_](https://www.cs.rhul.ac.uk/home/chrisw/new_thesis.pdf). Phd thesis, Cambridge University, 1989.

A description of Q-Learning with linear function approximation can be found in:

Francisco S. Melo and M. Isabel Ribeiro, [_Q-learning with linear functionapproximation_](http://gaips.inesc-id.pt/~fmelo/pub/melo07tr-b.pdf). Technical Report, RT-602-07, Instituto de Sistemas e Robótica, Pólo de Lisboa.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
This agent inherits sarsa\_lambda.py and re-implements the agent_step and update functions.

---
### delayed_qlearning.py
Implementation of [_PAC Model-Free Reinforcement Learning_](http://www.autonlab.org/icml_documents/camera-ready/111_PAC_Model_free_Reinf.pdf) by Alexander Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael Littman (2006).

The standard Q-Learning agent changes its Q-value estimates on every time step. Rather, Delayed Q-Learning waits for _m_ sample updates to make any changes (_m_ is a parameter of the algorithm). According to the above paper, _"this variation has an averaging effect that mitigates some of the effects of randomness"_ and makes it optimistic. _"Since the action-selection strategy is greedy, the Delayed Q-Learning agent will tend to choose overly optimistic actions, therefore achieving direct exploration when necessary"_.


##### REQUIREMENTS
This agent is meant to work with discrete-state/discrete-action domains.

##### NOTES
There might be a bug in this implementation. The code indicates that: _"Unfortunately, I have no yet been able to get this to work consistently on the marble maze domain. It seems likely that it would work on something simpler like chain domain. Maybe there's a bug?"_.

---
### lstd.py
Implements Least Squares Temporal Difference Learning (LSTD). The main reference for this agent is:

Michail Lagoudakis and Ronald Parr, _Least-Squares Policy Iteration_. Journal of Machine Learning Research, v. 4, 2003.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
The code says: _"This is actually very nearly an implementation of LSTD-Q. The only difference with the paper, is that the code does not store the samples themselves, and instead stores A and b. This means that it can't reuse samples as effectively when the policy changes"_.

The implementation inherits sarsa\_lambda.py.

---
### modelbased.py
Implements an agent that learns from the environment (e.g., using linear regression, a super vector machine, or a random forest) and plans using Fitted Q Iteration. The main reference for the planner is:

Damien Ernst, Pierre Geurts and Louis Wehenkel, [_Tree-Based Batch Mode Reinforcement Learning_](http://www.jmlr.org/papers/volume6/ernst05a/ernst05a.pdf). Journal of Machine Learning Research, v.6, 2005.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
This implementation supports using a variety of basis functions to represent the agent observations in a different space before passing them to the model learners.

The planner takes care of passing data to the model learner.

---
### mirror\_descent.py
Implements [_Sparse Q-Learning with Mirror Descent_](http://www.auai.org/uai2012/papers/261.pdf) by Sridhar Mahadevan and Bo Liu, 2012. This is a _proximal-gradient_ based temporal-difference (TD) algorithm that uses a p-norm distance generating function.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
This agent inherits qlearning.py.

---
### policy\_gradient.py (REINFORCE)
Implements the [REINFORCE](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) algorihtm by Ronald Williams.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
This agent inherits the policy_gradient class in policy\_gradient.py, which in turn inherits sarsa\_lambda.py.

Breaks with the Tetris environment.

---
### policy\_gradient.py (twotime\_ac)
Implements Regular-Gradient Actor-Critic. This is Algorithm 1 from [Natural Actor-Critic Algorithms](https://webdocs.cs.ualberta.ca/~sutton/papers/BSGL-TR.pdf) by Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee (2009).

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
This agent inherits the policy_gradient class in policy\_gradient.py, which in turn inherits sarsa\_lambda.py.

---
### policy\_gradient.py (twotime\_nac)
Implements Natural-Gradient Actor-Critic with Advantage Parameters. This is Algorithm 3 from [Natural Actor-Critic Algorithms](https://webdocs.cs.ualberta.ca/~sutton/papers/BSGL-TR.pdf) by Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee (2009).

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
This agent inherits the policy_gradient class in policy\_gradient.py, which in turn inherits sarsa\_lambda.py.

---
### policy\_gradient.py (nac_lstd)
Implements the [Natural Actor-Critic](https://homes.cs.washington.edu/~todorov/courses/amath579/reading/NaturalActorCritic.pdf) agent by Jan Peters and Stefan Schaal (2007). The actor updates are based on stochastic policy gradients (using Amari's natural gradient), while the critic obtains the natural gradient and additional parameters of the value function by linear regression.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
As the code indicates, this implementation _"deviates from the pseudo-code given in the paper because it uses the Sheman-Morrison formula to do incremental updates to the matrix inverse"_.

This agent inherits the policy_gradient class in policy\_gradient.py, which in turn inherits sarsa\_lambda.py.

---
### policy\_gradient.py (nac_sarsa)
Implements the Natural Actor-Critic with SARSA(lambda) by Philip S. Thomas. This is algorithm 2 in his 2012 [Bias in Natural Actor-Critic Algorithms](http://psthomas.com/papers/Thomas2012b.pdf) paper.

##### REQUIREMENTS
This agent is meant to work with continuous-state/discrete-action domains.

##### NOTES
The code says: _"While fundamentally the same as twotime\_nac (Algorithm 3 of BSGL's paper), this implements NACS which uses a different form of the same update equations. The main difference is in this algorithm's avoidance of the average reward accumulator"_.
Expand Down
6 changes: 3 additions & 3 deletions pyrl/agents/policy_gradient.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,14 +113,14 @@ class REINFORCE(policy_gradient):
name = "REINFORCE"

def agent_init(self,taskSpec):
super(REINFORCE, self).agent_init(self,taskSpec)
super(REINFORCE, self).agent_init(taskSpec)
self.baseline_numerator = numpy.zeros(self.weights.shape)
self.baseline_denom = numpy.zeros(self.weights.shape)
self.gradient_estimate = numpy.zeros(self.weights.shape)
self.ep_count = 0

def init_parameters(self):
super(REINFORCE, self).init_parameters(self)
super(REINFORCE, self).init_parameters()
self.num_rollouts = self.params.setdefault('num_rollouts', 5)

@classmethod
Expand All @@ -145,7 +145,7 @@ def agent_start(self,observation):

self.ep_count += 1
self.Return = 0.0
return super(REINFORCE, self).agent_start(self, observation)
return super(REINFORCE, self).agent_start(observation)

def update(self, phi_t, phi_tp, reward, compatFeatures):
self.traces += compatFeatures
Expand Down
17 changes: 13 additions & 4 deletions pyrl/agents/sarsa_lambda.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def init_parameters(self):
self.epsilon = self.params.setdefault('epsilon', 0.1)
self.alpha = self.params.setdefault('alpha', 0.01)
self.lmbda = self.params.setdefault('lmbda', 0.7)
self.gamma = self.params.setdefault('gamma', 1.0)
#self.gamma = self.params.setdefault('gamma', 1.0) use env discount factor
self.fa_name = self.params.setdefault('basis', 'trivial')
self.softmax = self.params.setdefault('softmax', False)
self.basis = None
Expand All @@ -38,7 +38,7 @@ def agent_parameters(cls):
param_set = super(sarsa_lambda, cls).agent_parameters()
add_parameter(param_set, "alpha", default=0.01, help="Step-size parameter")
add_parameter(param_set, "epsilon", default=0.1, help="Exploration rate for epsilon-greedy, or rescaling factor for soft-max.")
add_parameter(param_set, "gamma", default=1.0, help="Discount factor")
# add_parameter(param_set, "gamma", default=1.0, help="Discount factor")
add_parameter(param_set, "lmbda", default=0.7, help="Eligibility decay rate")

# Parameters *NOT* used in parameter optimization
Expand Down Expand Up @@ -81,6 +81,8 @@ def agent_init(self,taskSpec):
print "Task Spec could not be parsed: "+taskSpecString;
sys.exit(1)

self.gamma = TaskSpec.getDiscountFactor()

self.numStates=len(TaskSpec.getDoubleObservations())
self.discStates = numpy.array(TaskSpec.getIntObservations())
self.numDiscStates = int(reduce(lambda a, b: a * (b[1] - b[0] + 1), self.discStates, 1.0))
Expand Down Expand Up @@ -143,9 +145,16 @@ def sample_softmax(self, state, discState):
return numpy.where(Q >= numpy.random.random())[0][0]

def egreedy(self, state, discState):

if self.randGenerator.random() < self.epsilon:
return self.randGenerator.randint(0,self.numActions-1)
return numpy.dot(self.weights[discState,:,:].T, self.basis.computeFeatures(state)).argmax()
selected_action = self.randGenerator.randint(0,self.numActions-1)
else:
Qapprox = numpy.dot(self.weights[discState,:,:].T, self.basis.computeFeatures(state))
selected_action = Qapprox.argmax()
max_options = numpy.where(Qapprox == Qapprox[selected_action])[0].tolist()
if len(max_options) > 1:
selected_action = max_options[self.randGenerator.randint(0,len(max_options)-1)]
return selected_action

def getDiscState(self, state):
"""Return the integer value representing the current discrete state.
Expand Down
47 changes: 46 additions & 1 deletion pyrl/agents/skeleton_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
from pyrl.rlglue.registry import register_agent
from pyrl.misc.parameter import *

import cPickle

@register_agent
class skeleton_agent(Agent, object):
name = "Skeleton agent"
Expand Down Expand Up @@ -129,8 +131,20 @@ def agent_message(self,inMessage):
"""
if inMessage.lower() == "agent_diverged?":
return str(self.has_diverged())
elif len(inMessage) > 10 and inMessage.lower()[0:10] == "save_agent":
filename = inMessage.split()[1]
if self.saveAgent(filename) is True:
return "%s saved the agent state to '%s'" % (self.name,filename)
else:
return "ERROR: Could not save the agent to %s" % filename
elif len(inMessage) > 10 and inMessage.lower()[0:10] == "load_agent":
filename = inMessage.split()[1]
if self.saveAgent(filename) is True:
return "%s loaded the agent state from '%s'" % (self.name,filename)
else:
return "ERROR: Could not load the agent state from %s" % filename
else:
return name + " does not understand your message."
return self.name + " does not understand your message."

def has_diverged(self):
"""Overwrite the function with one that checks the key values for your
Expand All @@ -140,6 +154,37 @@ def has_diverged(self):

return False

def loadAgent(self, filename):
"""Unpickle the agent
Args:
filename - file with pickled agent
"""
try:
f = open(filename,'rb')
tmp_dict = cPickle.load(f)
f.close()
print "Updating agent dictionary with the pickled data (%s)" % filename
self.__dict__.update(tmp_dict)
except IOError:
print "Failed to load agent from %s" % filename
return False
return True

def saveAgent(self, filename):
"""Pickle the agent
Args:
filename - filename of pickled agent
"""
try:
f = open(filename,'wb')
cPickle.dump(self.__dict__,f,2)
f.close()
except IOError:
print "Failed to save agent to %s" % filename
return False
return True


def runAgent(agent_class):
"""Use the agent_parameters function to parse command line arguments
and run the RL agent in network mode.
Expand Down
6 changes: 5 additions & 1 deletion pyrl/agents/stepsizes.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,11 @@ def init_stepsize(self, weights_shape, params):
def rescale_update(self, phi_t, phi_tp, delta, reward, descent_direction):
deltaPhi = (self.gamma * phi_tp - phi_t).flatten()
denomTerm = numpy.dot(self.traces.flatten(), deltaPhi.flatten())
self.alpha = numpy.min([self.alpha, 1.0/numpy.abs(denomTerm)])
absDenomTerm = numpy.abs(denomTerm)
if absDenomTerm > 1e-6:
self.alpha = numpy.min([self.alpha, 1.0/numpy.abs(denomTerm)])
else:
self.alpha = self.alpha
self.step_sizes.fill(self.alpha)
return self.step_sizes * descent_direction

Expand Down
5 changes: 5 additions & 0 deletions pyrl/basis/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,8 @@ pip-log.txt

#Mr Developer
.mr.developer.cfg

CMakeCache.txt
CMakeFiles
Makefile
cmake_install.cmake
6 changes: 6 additions & 0 deletions pyrl/basis/CTiles/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,9 @@ pip-log.txt

#Mr Developer
.mr.developer.cfg

# Compiled CTiles
CMakeCache.txt
CMakeFiles
Makefile
cmake_install.cmake
11 changes: 7 additions & 4 deletions pyrl/misc/parameter.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,8 @@ def sample_exprand(self, size=None):
def parameter_set(alg_name, **kwargs):
kwargs['prog'] = alg_name
kwargs['conflict_handler'] = 'resolve'
kwargs['add_help'] = False
if not kwargs.has_key('add_help'):
kwargs['add_help'] = False
parser = argparse.ArgumentParser(**kwargs)
parser.add_argument_group(title="optimizable",
description="Algorithm parameters that should/can be optimized. " + \
Expand All @@ -81,16 +82,18 @@ def add_parameter(parser, name, min=0., max=1.0, optimize=True, **kwargs):

if kwargs.has_key('choices'):
kwargs.setdefault('type', kwargs['choices'][0].__class__)
elif kwargs.has_key('action'):
pass
else:
# Otherwise, default to float
kwargs.setdefault('type', float)
# No choices specified, so generate them based on type
if kwargs['type'] in [int, float]:
value_range = ValueRange(min, max, dtype=kwargs['type'])
kwargs['choices'] = value_range
kwargs['metavar'] = str(min) + ".." + str(max)
elif kwargs['type'] is not bool:
raise TypeError("String typed parameter requires 'choices' argument")
kwargs['metavar'] = str(min) + ".." + str(max) + " (default: " + str(kwargs['default']) + ")"
# elif kwargs['type'] is not bool:
# raise TypeError("String typed parameter requires 'choices' argument")

if optimize:
i = map(lambda k: k.title, parser._action_groups).index("optimizable")
Expand Down