Merge remote-tracking branch 'origin/tcmxx/docs'

tcmxx · tcmxx · commit 884bcf12b301 · 2018-08-10T14:05:33.000+03:00
diff --git a/Documents/AgentDependentDeicision.md b/Documents/AgentDependentDeicision.md
@@ -0,0 +1,21 @@
+# AgentDependentDeicision
+
+If you want to use your own policy on a specific agent instead of using the Brain when using `TrainerPPO` or `TrainerMimic`, here is the way.
+
+Implement the abstract class called AgentDependentDecision. You only need to implement one abstract method: 
+```csharp
+    /// <summary>
+    /// Implement this method for your own ai decision.
+    /// </summary>
+    /// <param name="vectorObs">vector observations</param>
+    /// <param name="visualObs">visual observations</param>
+    /// <param name="heuristicAction">The default action from brain if you are not using the decision</param>
+    /// <param name="heuristicVariance">The default action variance from brain if you are not using the decision. 
+    /// It might be null if discrete aciton space is used or the Model does not support variance.</param>
+    /// <returns>the actions</returns>
+    public abstract float[] Decide(List<float> vectorObs, List<Texture2D> visualObs, List<float> heuristicAction, List<float> heuristicVariance = null);
+```
+
+Then, attach your new script to the agent you want to use your policy, and check the `useDecision` in inspector.
+
+Note that your policy is only used under certain training setting when using `TrainerPPO` or `TrainerMimic`. See [Training with Proximal Policy Optimization(PPO)](Training-PPO.md) and [Training with Imitation(Supervised Learning)](Training-SupervisedLearning.md) for more details.
diff --git a/Documents/BasicConcepts.md b/Documents/BasicConcepts.md
@@ -18,7 +18,6 @@ This repo provides or plans to provide following tools for game AI and machine l
 
 4. Neural Evolution(work in progess)
 	* Evolve the neural network's weights using MAES instead of gradien descent.
-
 ### Other tools
 1. GAN(Generative adversarial network)
 	* Including [Traning with Prediction to Stableize](https://www.semanticscholar.org/paper/Stabilizing-Adversarial-Nets-With-Prediction-Yadav-Shah/ec25504486d8751e00e613ca6fa64b256e3581c8).
@@ -32,11 +31,22 @@ You can fisrt go through the [vverview of Unity ML-Agents](https://github.com/Un
 
 Assume that you are somehow familiar with Unity ML-Agents, then following will be some brief explanation of concepts/key conponents that are used in this repo.
 (To be added)
-* Trainer
-* Model
-* MEAS Optimizer
-* UnityNetwork
-* Agent Dependent Decision
+### Trainer
+The Brain in ML-Agent will communiate the with a Trainer to train the Model. We added a `CoreBrainInternalTrainable` on top of the existing core brains in ML-Agent which can communicate with our Trainers. The CoreBrainInternalTrainable works with any Monobehaviour that implement the `ITrainer` interface. 
+
+We made some Trainers for you already for specific algorithm including PPO, SupervisedLearning and Evolution Strategy.
+
+### Model
+Models are the core of our AI. You can query information including the actions giving it the observations. Also, it provides interface to train the neural network.
+
+Trainers will ask for actions and other training related data from Models during the training, and also ask to train the neural network when enough data can be provided.
+
+### UnityNetwork
+
+We defined some UnityNetwork scriptable objects, where you can easily define a neural network architecture for different Models, and use them as plugin modules(thanks to Unity's Scriptable Object). 
+
+The models implemented by us usually need a network scriptable object that implement certain interface. We have already made the simple version of those network for you. However, you can also easily make your own customized network.
+
 
 ## Features Not Gonna Have
 1. Curriculum Training
diff --git a/Documents/ExamplesList.md b/Documents/ExamplesList.md
@@ -25,7 +25,11 @@ Please go [HERE](IntelligentPoolDetails.md) for the complete description and ana
         width="600" border="10" />
 </p>
 
-This is just a copy of the Unity ML-Agents' [3DBall environment](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#3dball-3d-balance-ball), with modifications for in editor training, as a result of [Getting Started with Balance Ball](Getting-Started-with-Balance-Ball.md) tutorial. It uses PPO.
+This is just a copy of the Unity ML-Agents' [3DBall environment](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#3dball-3d-balance-ball), with modifications for in editor training tutorial. 
+
+* Scenes:
+	- 3DBall: Basic PPO example used by [Getting Started with Balance Ball](Getting-Started-with-Balance-Ball.md) tutorial. 
+	- 3DBallNE: Basic Neural Evolution example. 
 
 ## Pong
 <p align="center">
@@ -95,11 +99,25 @@ Click StartTraining to generate training data and start training.
 
 Click UseGAN to generate data from GAN(blue).
 
+## Crawler
+<p align="center">
+    <img src="Images/ExampleList/Crawler.png" 
+        alt="Crawler" 
+        width="600" border="10" />
+</p>
+
+This is just a copy of the Unity ML-Agents' [3DBall environment](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#3dball-3d-balance-ball), with modifications.
+
+* Scenes: 
+	- CrawlerNE: Neural Evolution example.
+	- Crawler??: Experiemental scene  for hybrid training(Similiar to Evolved Policy Gradient.)
+	
+
 ## Walker
 <p align="center">
     <img src="Images/ExampleList/Walker.png" 
         alt="Walker" 
         width="600" border="10" />
 </p>
 
-A copy of Unity MLAgent's Walker example. A test scene for hybird training. Not working at all. Don't use it.
+A copy of Unity MLAgent's Walker example. A test scene for hybrid training. Not working at all. Don't use it.
diff --git a/Documents/Readme.md b/Documents/Readme.md
@@ -13,21 +13,14 @@ For ML-Agents or related machine learning knowledge, see ML-Agents [documentatio
  * [Features and Basic Concepts](BasicConcepts.md)
  * [Example Environments](ExamplesList.md)
 
-## Reinforcement Learning
+## Learning
  * [Training with Proximal Policy Optimization(PPO)](Training-PPO.md) 
- * [PPO with Heuristic]
- 
-## Supervised Learning
- * [Training with Imitation]
- * [GAN]
+ * [Training with Imitation(Supervised Learning)](Training-SL.md) 
+ * [Use Neural Evolution to optimize Neural Network](Neural-Evolution.md)
  
 ## MAES Optimization
  * [Use MAES Optimization to Find the Best Solution]
- * [Use MAES for Reinforcement Learning or Supervised Learning]
- 
-## Neural Evolution and Hybrid Learning
- * [Use Neural Evolution to optimizer Neural Network]
- * [Hybrid Learning with PPO and Neural Evolution]
+ * [Use MAES and Supervised Learning]
  
 ## Customization
  * [Define Your Own Training Process for ML-Agent]
diff --git a/Documents/TrainerParamOverride.md b/Documents/TrainerParamOverride.md
@@ -0,0 +1 @@
+# TrainerParamOverride
diff --git a/Documents/Training-PPO.md b/Documents/Training-PPO.md
@@ -0,0 +1,74 @@
+# Training with Proximal Policy Optimization(PPO)
+
+PPO is a popular reinforcement learning algorithm. See [this paper](https://arxiv.org/abs/1707.06347) for details.
+
+Here, we are only going to tell how to use our existing code to train your ML-Agent environment in editor.
+
+The example [Getting Started with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md) briefly shows how to use PPO to train an existing ML-Agent environment in editor. Here we are going to cover a little more details. 
+
+## Overall Steps
+1. Create a environment using ML-Agent API. See the [instruction from Unity](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Create-New.md)
+3. Change the BrainType of your brain to `InternalTrainable` in inspector.
+2. Create a Trainer
+	1. Attach a `TrainerPPO.cs` to any GameObject.
+    2. Create a `TrainerParamsPPO` scriptable object with proper parameters in your project and assign it to the Params field in `TrainerPPO.cs`.
+    3. Assign the Trainer to the `Trainer` field of your Brain.
+3. Create a Model
+	1. Attach a `RLModelPPO.cs` to any GameObject.
+    2. Create a `RLNetworkSimpleAC` scriptable with proper object in your project and assign it to the Network field in `RLModelPPO.cs`.
+    3. Assign the created Model to the `modelRef` field of in `TrainerPPO.cs`
+    
+4. Play and see how it works.
+
+## Explanation of fields in the inspector
+We use similar parameters as in Unity ML-Agents. If something is confusing, read see their [document](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-PPO.md) for mode datails.
+
+### TrainerPPO.cs
+* `isTraining`: Toggle this to switch between training and inference mode. Note that if isTraining if false when the game starts, the training part of the PPO model will not be initialize and you won't be able to train it in this run. Also,
+* `parameters`: You need to assign this field with a TrainerParamsPPO scriptable object. 
+* `continueFromCheckpoint`: If true, when the game starts, the trainer will try to load the saved checkpoint file to resume previous training.
+* `checkpointPath`: the path of the checkpoint, including the file name. 
+* `steps`: Just to show you the current step of the training.
+
+### TrainerParamsPPO
+* `learningRate`: Learning rate used to train the neural network.
+* `maxTotalSteps`: Max steps the trainer will be training.
+* `saveModelInterval`: The trained model will be saved every this amount of steps.
+* `rewardDiscountFactor`: Gamma. See PPO algorithm for details.
+* `rewardGAEFactor`: Lambda. See PPO algorithm for details.
+* `valueLossWeight`: Weight of the value loss compared with the policy loss in PPO.
+* `timeHorizon`: Max steps when the PPO trainer will calculate the advantages using the collected data.
+* `entropyLossWeight`: Weight of the entropy loss.
+* `clipEpsilon`: See PPO algorithm for details. The default value is usually fine.
+* `batchSize`: Mini batch size when training.
+* `bufferSizeForTrain`: PPO will train the model once when the buffer size reaches this.
+* `numEpochPerTrain`: For each training, the data in the buffer will be used repeatedly this amount of times.
+* `useHeuristicChance`: See [Training with Heuristics](#training-with-heuristics).
+
+### RLModelPPO.cs
+* `checkpointToLoad`: If you assign a model's saved checkpoint file to it, this will be loaded when model is initialized, regardless of the trainer's loading. Might be used when you are not using a trainer.
+* `Network`: You need to assign this field with a scriptable object that implements RLNetworkPPO.cs. 
+* `optimizer`: The time of optimizer to use for this model when training. You can also set its parameters here.
+
+### RLNetworkSimpleAC
+This is a simple implementation of RLNetworkAC that you can create a plug it in as a neural network definition for any RLModelPPO. PPO uses actor/critic structure(See PPO algorithm).
+- `actorHiddenLayers`/`criticHiddenLayers`: Hidden layers of the network. The array size if the number of hidden layers. In each element, there are for parameters that defines each layer. Those do not have default values, so you have to fill them.
+	- size: Size of this hidden layer. 
+    - initialScale: Initial scale of the weights. This might be important for training.Try something larger than 0 and smaller than 1.
+    - useBias: Whether Use bias. Usually true.
+    - activationFunction: Which activation function to use. Usually Relu.
+- `actorOutputLayerInitialScale`/`criticOutputLayerInitialScale`/`visualEncoderInitialScale`: Initial scale of the weights of the output layers.
+- `actorOutputLayerBias`/`criticOutputLayerBias`/`visualEncoderBias`: Whether use bias.
+
+## Training with Heuristics
+If you already know some policy that is better than random policy, you might give it as a hint to PPO to increase the training a bit. 
+
+1. Implement the [AgentDependentDeicision](AgentDependentDeicision.md) for your policy and attach it to the agents that you want them to occasionally use this policy.
+2. In your trainer parameters, set `useHeuristicChance` to larger than 0.
+3. Use [TrainerParamOverride](TrainerParamOverride.md) to decrease the `useHeuristicChance` over time during the training.
+
+Note that your AgentDependentDeicision is only used in training mode. The chance of using it in each step for agent with the script attached depends on `useHeuristicChance`.
+
+
+
+
diff --git a/README.md b/README.md
@@ -26,5 +26,13 @@ Android does not support any type of gradient/training. IOS is not tested a all.
 ## Future Plan:
 We plan to keep this repo updated with latest game related machine learning technologies for the course every year.
 
+Possible future plans/contributions:
+* Updating [KerasSharp](https://github.com/tcmxx/keras-sharp).
+* More benchmark environments.
+* Better API for in game usage.
+* More algorithms including: Deep Q Learning, Deep Mimic, Evolved Policy Gradient, Genetic Algorithms and so on.
+* Improving the logging tool or using Tensorboard in c#.
+* Graphic editor for neural network architecture
+
 ## License
-[MIT](LICENSE)
+[MIT](LICENSE).