Skip to content

Commit 884bcf1

Browse files
committed
Merge remote-tracking branch 'origin/tcmxx/docs'
2 parents 78dd409 + a8ab515 commit 884bcf1

File tree

7 files changed

+145
-20
lines changed

7 files changed

+145
-20
lines changed
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# AgentDependentDeicision
2+
3+
If you want to use your own policy on a specific agent instead of using the Brain when using `TrainerPPO` or `TrainerMimic`, here is the way.
4+
5+
Implement the abstract class called AgentDependentDecision. You only need to implement one abstract method:
6+
```csharp
7+
/// <summary>
8+
/// Implement this method for your own ai decision.
9+
/// </summary>
10+
/// <param name="vectorObs">vector observations</param>
11+
/// <param name="visualObs">visual observations</param>
12+
/// <param name="heuristicAction">The default action from brain if you are not using the decision</param>
13+
/// <param name="heuristicVariance">The default action variance from brain if you are not using the decision.
14+
/// It might be null if discrete aciton space is used or the Model does not support variance.</param>
15+
/// <returns>the actions</returns>
16+
public abstract float[] Decide(List<float> vectorObs, List<Texture2D> visualObs, List<float> heuristicAction, List<float> heuristicVariance = null);
17+
```
18+
19+
Then, attach your new script to the agent you want to use your policy, and check the `useDecision` in inspector.
20+
21+
Note that your policy is only used under certain training setting when using `TrainerPPO` or `TrainerMimic`. See [Training with Proximal Policy Optimization(PPO)](Training-PPO.md) and [Training with Imitation(Supervised Learning)](Training-SupervisedLearning.md) for more details.

Documents/BasicConcepts.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ This repo provides or plans to provide following tools for game AI and machine l
1818

1919
4. Neural Evolution(work in progess)
2020
* Evolve the neural network's weights using MAES instead of gradien descent.
21-
2221
### Other tools
2322
1. GAN(Generative adversarial network)
2423
* Including [Traning with Prediction to Stableize](https://www.semanticscholar.org/paper/Stabilizing-Adversarial-Nets-With-Prediction-Yadav-Shah/ec25504486d8751e00e613ca6fa64b256e3581c8).
@@ -32,11 +31,22 @@ You can fisrt go through the [vverview of Unity ML-Agents](https://github.com/Un
3231

3332
Assume that you are somehow familiar with Unity ML-Agents, then following will be some brief explanation of concepts/key conponents that are used in this repo.
3433
(To be added)
35-
* Trainer
36-
* Model
37-
* MEAS Optimizer
38-
* UnityNetwork
39-
* Agent Dependent Decision
34+
### Trainer
35+
The Brain in ML-Agent will communiate the with a Trainer to train the Model. We added a `CoreBrainInternalTrainable` on top of the existing core brains in ML-Agent which can communicate with our Trainers. The CoreBrainInternalTrainable works with any Monobehaviour that implement the `ITrainer` interface.
36+
37+
We made some Trainers for you already for specific algorithm including PPO, SupervisedLearning and Evolution Strategy.
38+
39+
### Model
40+
Models are the core of our AI. You can query information including the actions giving it the observations. Also, it provides interface to train the neural network.
41+
42+
Trainers will ask for actions and other training related data from Models during the training, and also ask to train the neural network when enough data can be provided.
43+
44+
### UnityNetwork
45+
46+
We defined some UnityNetwork scriptable objects, where you can easily define a neural network architecture for different Models, and use them as plugin modules(thanks to Unity's Scriptable Object).
47+
48+
The models implemented by us usually need a network scriptable object that implement certain interface. We have already made the simple version of those network for you. However, you can also easily make your own customized network.
49+
4050

4151
## Features Not Gonna Have
4252
1. Curriculum Training

Documents/ExamplesList.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,11 @@ Please go [HERE](IntelligentPoolDetails.md) for the complete description and ana
2525
width="600" border="10" />
2626
</p>
2727

28-
This is just a copy of the Unity ML-Agents' [3DBall environment](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#3dball-3d-balance-ball), with modifications for in editor training, as a result of [Getting Started with Balance Ball](Getting-Started-with-Balance-Ball.md) tutorial. It uses PPO.
28+
This is just a copy of the Unity ML-Agents' [3DBall environment](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#3dball-3d-balance-ball), with modifications for in editor training tutorial.
29+
30+
* Scenes:
31+
- 3DBall: Basic PPO example used by [Getting Started with Balance Ball](Getting-Started-with-Balance-Ball.md) tutorial.
32+
- 3DBallNE: Basic Neural Evolution example.
2933

3034
## Pong
3135
<p align="center">
@@ -95,11 +99,25 @@ Click StartTraining to generate training data and start training.
9599

96100
Click UseGAN to generate data from GAN(blue).
97101

102+
## Crawler
103+
<p align="center">
104+
<img src="Images/ExampleList/Crawler.png"
105+
alt="Crawler"
106+
width="600" border="10" />
107+
</p>
108+
109+
This is just a copy of the Unity ML-Agents' [3DBall environment](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#3dball-3d-balance-ball), with modifications.
110+
111+
* Scenes:
112+
- CrawlerNE: Neural Evolution example.
113+
- Crawler??: Experiemental scene for hybrid training(Similiar to Evolved Policy Gradient.)
114+
115+
98116
## Walker
99117
<p align="center">
100118
<img src="Images/ExampleList/Walker.png"
101119
alt="Walker"
102120
width="600" border="10" />
103121
</p>
104122

105-
A copy of Unity MLAgent's Walker example. A test scene for hybird training. Not working at all. Don't use it.
123+
A copy of Unity MLAgent's Walker example. A test scene for hybrid training. Not working at all. Don't use it.

Documents/Readme.md

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,21 +13,14 @@ For ML-Agents or related machine learning knowledge, see ML-Agents [documentatio
1313
* [Features and Basic Concepts](BasicConcepts.md)
1414
* [Example Environments](ExamplesList.md)
1515

16-
## Reinforcement Learning
16+
## Learning
1717
* [Training with Proximal Policy Optimization(PPO)](Training-PPO.md)
18-
* [PPO with Heuristic]
19-
20-
## Supervised Learning
21-
* [Training with Imitation]
22-
* [GAN]
18+
* [Training with Imitation(Supervised Learning)](Training-SL.md)
19+
* [Use Neural Evolution to optimize Neural Network](Neural-Evolution.md)
2320

2421
## MAES Optimization
2522
* [Use MAES Optimization to Find the Best Solution]
26-
* [Use MAES for Reinforcement Learning or Supervised Learning]
27-
28-
## Neural Evolution and Hybrid Learning
29-
* [Use Neural Evolution to optimizer Neural Network]
30-
* [Hybrid Learning with PPO and Neural Evolution]
23+
* [Use MAES and Supervised Learning]
3124

3225
## Customization
3326
* [Define Your Own Training Process for ML-Agent]

Documents/TrainerParamOverride.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# TrainerParamOverride

Documents/Training-PPO.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Training with Proximal Policy Optimization(PPO)
2+
3+
PPO is a popular reinforcement learning algorithm. See [this paper](https://arxiv.org/abs/1707.06347) for details.
4+
5+
Here, we are only going to tell how to use our existing code to train your ML-Agent environment in editor.
6+
7+
The example [Getting Started with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md) briefly shows how to use PPO to train an existing ML-Agent environment in editor. Here we are going to cover a little more details.
8+
9+
## Overall Steps
10+
1. Create a environment using ML-Agent API. See the [instruction from Unity](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Create-New.md)
11+
3. Change the BrainType of your brain to `InternalTrainable` in inspector.
12+
2. Create a Trainer
13+
1. Attach a `TrainerPPO.cs` to any GameObject.
14+
2. Create a `TrainerParamsPPO` scriptable object with proper parameters in your project and assign it to the Params field in `TrainerPPO.cs`.
15+
3. Assign the Trainer to the `Trainer` field of your Brain.
16+
3. Create a Model
17+
1. Attach a `RLModelPPO.cs` to any GameObject.
18+
2. Create a `RLNetworkSimpleAC` scriptable with proper object in your project and assign it to the Network field in `RLModelPPO.cs`.
19+
3. Assign the created Model to the `modelRef` field of in `TrainerPPO.cs`
20+
21+
4. Play and see how it works.
22+
23+
## Explanation of fields in the inspector
24+
We use similar parameters as in Unity ML-Agents. If something is confusing, read see their [document](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-PPO.md) for mode datails.
25+
26+
### TrainerPPO.cs
27+
* `isTraining`: Toggle this to switch between training and inference mode. Note that if isTraining if false when the game starts, the training part of the PPO model will not be initialize and you won't be able to train it in this run. Also,
28+
* `parameters`: You need to assign this field with a TrainerParamsPPO scriptable object.
29+
* `continueFromCheckpoint`: If true, when the game starts, the trainer will try to load the saved checkpoint file to resume previous training.
30+
* `checkpointPath`: the path of the checkpoint, including the file name.
31+
* `steps`: Just to show you the current step of the training.
32+
33+
### TrainerParamsPPO
34+
* `learningRate`: Learning rate used to train the neural network.
35+
* `maxTotalSteps`: Max steps the trainer will be training.
36+
* `saveModelInterval`: The trained model will be saved every this amount of steps.
37+
* `rewardDiscountFactor`: Gamma. See PPO algorithm for details.
38+
* `rewardGAEFactor`: Lambda. See PPO algorithm for details.
39+
* `valueLossWeight`: Weight of the value loss compared with the policy loss in PPO.
40+
* `timeHorizon`: Max steps when the PPO trainer will calculate the advantages using the collected data.
41+
* `entropyLossWeight`: Weight of the entropy loss.
42+
* `clipEpsilon`: See PPO algorithm for details. The default value is usually fine.
43+
* `batchSize`: Mini batch size when training.
44+
* `bufferSizeForTrain`: PPO will train the model once when the buffer size reaches this.
45+
* `numEpochPerTrain`: For each training, the data in the buffer will be used repeatedly this amount of times.
46+
* `useHeuristicChance`: See [Training with Heuristics](#training-with-heuristics).
47+
48+
### RLModelPPO.cs
49+
* `checkpointToLoad`: If you assign a model's saved checkpoint file to it, this will be loaded when model is initialized, regardless of the trainer's loading. Might be used when you are not using a trainer.
50+
* `Network`: You need to assign this field with a scriptable object that implements RLNetworkPPO.cs.
51+
* `optimizer`: The time of optimizer to use for this model when training. You can also set its parameters here.
52+
53+
### RLNetworkSimpleAC
54+
This is a simple implementation of RLNetworkAC that you can create a plug it in as a neural network definition for any RLModelPPO. PPO uses actor/critic structure(See PPO algorithm).
55+
- `actorHiddenLayers`/`criticHiddenLayers`: Hidden layers of the network. The array size if the number of hidden layers. In each element, there are for parameters that defines each layer. Those do not have default values, so you have to fill them.
56+
- size: Size of this hidden layer.
57+
- initialScale: Initial scale of the weights. This might be important for training.Try something larger than 0 and smaller than 1.
58+
- useBias: Whether Use bias. Usually true.
59+
- activationFunction: Which activation function to use. Usually Relu.
60+
- `actorOutputLayerInitialScale`/`criticOutputLayerInitialScale`/`visualEncoderInitialScale`: Initial scale of the weights of the output layers.
61+
- `actorOutputLayerBias`/`criticOutputLayerBias`/`visualEncoderBias`: Whether use bias.
62+
63+
## Training with Heuristics
64+
If you already know some policy that is better than random policy, you might give it as a hint to PPO to increase the training a bit.
65+
66+
1. Implement the [AgentDependentDeicision](AgentDependentDeicision.md) for your policy and attach it to the agents that you want them to occasionally use this policy.
67+
2. In your trainer parameters, set `useHeuristicChance` to larger than 0.
68+
3. Use [TrainerParamOverride](TrainerParamOverride.md) to decrease the `useHeuristicChance` over time during the training.
69+
70+
Note that your AgentDependentDeicision is only used in training mode. The chance of using it in each step for agent with the script attached depends on `useHeuristicChance`.
71+
72+
73+
74+

README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,5 +26,13 @@ Android does not support any type of gradient/training. IOS is not tested a all.
2626
## Future Plan:
2727
We plan to keep this repo updated with latest game related machine learning technologies for the course every year.
2828

29+
Possible future plans/contributions:
30+
* Updating [KerasSharp](https://github.com/tcmxx/keras-sharp).
31+
* More benchmark environments.
32+
* Better API for in game usage.
33+
* More algorithms including: Deep Q Learning, Deep Mimic, Evolved Policy Gradient, Genetic Algorithms and so on.
34+
* Improving the logging tool or using Tensorboard in c#.
35+
* Graphic editor for neural network architecture
36+
2937
## License
30-
[MIT](LICENSE)
38+
[MIT](LICENSE).

0 commit comments

Comments
 (0)