Online Learning

Currently, in the critic-based method, we are using multiple rollouts to update. We can possibly make the buffer size smaller than the horizon length to update online, but this would be at the sacrifice of the training speed. 

PS. I think it's already supported, just not tested yet.