You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PPO-Q: Proximal Policy Optimization with Parametrized Quantum Policies or Values
Setup
# Ensure that the Python version is 3.10
pip install --editable ./third_party/torchquantum
pip install quarkstudio==7.0.5
pip install gymnasium[box2d]==0.29.1
Usage
We offer a user-friendly Python script and accompanying configuration files to facilitate training hybrid quantum-classical models in diverse reinforcement learning environments.
python main.py <config_file_name>
Replace <config_file_name> with the desired environment from the ./config directory or create a custom configuration of your own.
Description of Configuration Parameters
Parameter
Description
Example Value
env_name
Name of the reinforcement learning environment.
LunarLander-v2
n_steps
Number of steps per environment per update.
1024
mini_batch_size
Size of the mini-batch.
64
max_train_steps
Maximum number of training steps.
1,750,000
lr_a
Learning rate for the actor network.
0.003
lr_c
Learning rate for the critic network.
0.0003
gamma
Discount factor.
0.999
lamda
GAE parameter.
0.98
epsilon
PPO clip parameter.
0.2
K_epochs
Number of PPO epochs.
4
entropy_coef
Entropy coefficient.
0.01
num_envs
Number of environments to run in parallel.
16
n_blocks
Number of blocks in the quantum reinforcement learning network.
1
n_wires
Number of qubits in the quantum circuit.
4
use_quafu
Specify whether to use Quafu quantum hardware
True
key
Token required for accessing Quafu cloud quantum hardware
' '
Training results can be visualized using TensorBoard:
tensorboard --logdir=./runs
Results
Benchmark reinforcement learning environments have been successfully solved using PPO-Q, as illustrated in the following table and figures.
Environment
State Space Dimension
Action Space Dimension
CartPole
4
2
MountainCar
2
3
Acrobot
6
3
LunarLander
8
4
MountainCar(C)
2
1
Pendulum
3
1
LunarLander(C)
8
2
BipedalWalker
24
4
CartPole
Acrobot
LunarLander
MountainCarC
Pendulum
BipedalWalker
Citation
arxiv is coming soon!
About
Proximal Policy Optimization with Parametrized Quantum Policies or Values