Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,214 changes: 2,214 additions & 0 deletions docs/superpowers/plans/2026-03-19-phase-d-rl-integration.md

Large diffs are not rendered by default.

320 changes: 320 additions & 0 deletions docs/superpowers/specs/2026-03-19-phase-d-rl-integration-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
# OpenSMC Phase D: RL Integration — Design Spec

**Date:** 2026-03-19
**Status:** Approved
**Scope:** Port autosmc RL infrastructure into `opensmc` Python package + full benchmark suite

## Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Goal | Port infrastructure + run benchmark | Both needed for Paper 3 |
| Plants | 5 existing Gym envs | DoubleInt, Crane, Quadrotor, InvPendulum, PMSM — strong cross-section |
| RL formulation | Surface discovery (agent outputs sigma) | Core novelty; enables fingerprinting |
| Benchmark tables | Fixed law + best-matched controller | Two stories: surface quality + practical performance |
| RL algorithms | PPO + SAC | On-policy + off-policy proves algorithm-independence |
| Training mode | Independent per plant | Fingerprinting answers cross-plant question without curriculum |

## Module Structure

```
opensmc/rl/
├── __init__.py # public API (expanded)
├── rl_surface.py # EXISTS — fix: SAC loading + 4D obs padding
├── fingerprinting.py # EXISTS — no changes
├── discovery_env.py # NEW — SurfaceDiscoveryEnv (agent outputs sigma)
├── trainer.py # NEW — PPO/SAC training with VecNormalize
├── benchmark.py # NEW — RL vs 17 controllers x 5 plants
└── visualize.py # NEW — heatmaps, radar, contours, bar charts
```

## Component 1: SurfaceDiscoveryEnv

Gymnasium environment where the RL agent discovers sliding surfaces. Ported from `autosmc/envs/discovery_env.py`, adapted to wrap OpenSMC `Plant` classes.

**Interface:**
```python
env = SurfaceDiscoveryEnv(
plant="double_integrator", # string or Plant instance
disturbance="sinusoidal", # none | constant | sinusoidal | step
disturbance_amplitude=1.0,
dt=0.01,
max_steps=500, # 5 seconds
sigma_max=20.0,
control_gains={"K": 5, "lam": 3, "phi": 0.05},
)
```

**String-to-plant mapping:** When `plant` is a string, the env resolves it via:
```python
PLANT_REGISTRY = {
"double_integrator": DoubleIntegrator,
"inverted_pendulum": InvertedPendulum,
"crane": Crane,
"quadrotor": Quadrotor,
"pmsm": PMSM,
}
```
When `plant` is a `Plant` instance, it is used directly. The env calls `plant.dynamics(t, x, u)` for RK4 integration.

**Spaces:**
- Observation: `Box(4)` — `[e, edot, sigma_prev, |u_prev|]`
- low: `[-10, -10, -50, 0]`, high: `[10, 10, 50, 100]`
- Action: `Box(1)` — `[-1, 1]` scaled to `[-sigma_max, sigma_max]`

**Control law:** `u = -K * sat(sigma / phi) - lam * sigma`

This is a self-contained control law internal to the env, NOT composed with an OpenSMC `Controller` instance. This is intentional: the env must be a standard Gymnasium environment with a simple `step(action) -> obs` interface. The RL agent discovers the surface; the fixed control law converts it to a control input. This matches the autosmc formulation and ensures the training reward reflects surface quality, not controller design.

**Reward:**
```
r = -1.0 * e^2 - 0.01 * u^2 - 0.005 * du^2
+ 5.0 * dt (when |e| < 0.02 and |edot| < 0.05)
- 50 (on truncation: |x[0]| > 10 or |x[1]| > 20)
```

**Plant-to-error mapping:**

| Plant | e | edot | How |
|-------|---|------|-----|
| DoubleIntegrator | `x[0] - ref` | `x[1]` | direct state |
| InvertedPendulum | `x[2]` (theta) | `x[3]` (theta_dot) | direct state |
| Crane | `x[0] - ref` (trolley) | `x[1]` (trolley_dot) | direct state |
| Quadrotor | `x[2] - ref` (z) | `x[8]` (vz) | direct state |
| PMSM | `x[2] - ref` (omega) | `(1.5*pp*psi_f*x[1] - B*x[2] - TL) / J` | computed from dynamics |

PMSM `edot` is computed from the motor dynamics: `domega/dt = (Te - B*omega - TL) / J` where `Te = 1.5 * pp * psi_f * i_q` (non-salient pole simplification). The env accesses `i_q = x[1]` from the PMSM state vector `[i_d, i_q, omega, theta]`. Load torque `TL` is separate from the disturbance `d` (TL is constant mechanical load, d is added to the dynamics as external perturbation).

**Integration:** RK4 with 4 substeps per dt.

**Registration:** `OpenSMC/SurfaceDiscovery-v0` with plant passed via `env_kwargs`.

## Component 2: Trainer

Wraps stable-baselines3 PPO/SAC with OpenSMC-validated defaults.

**Interface:**
```python
result = train_surface(
plant="double_integrator",
algorithm="PPO", # "PPO" | "SAC"
disturbance="sinusoidal",
total_timesteps=500_000,
n_envs=4,
net_arch=[64, 64],
seed=42,
output_dir="trained_models/",
eval_freq=10_000,
save_best=True,
normalize_obs=True,
normalize_reward=True,
clip_obs=10.0,
env_kwargs=None,
)
```

**Algorithm defaults:**

| Parameter | PPO | SAC |
|-----------|-----|-----|
| learning_rate | 3e-4 | 3e-4 |
| n_steps / buffer_size | 512 | 100_000 |
| batch_size | 64 | 256 |
| n_epochs / train_freq | 10 | 1 |
| gamma | 0.99 | 0.99 |
| gae_lambda / tau | 0.95 | 0.005 |
| clip_range | 0.2 | n/a |
| net_arch | [64, 64] | [64, 64] |

No learning rate scheduling — constant 3e-4.

**VecNormalize setup:**
```python
def _make_env(plant, disturbance, env_kwargs):
def _init():
return SurfaceDiscoveryEnv(plant=plant, disturbance=disturbance, **(env_kwargs or {}))
return _init

env = DummyVecEnv([_make_env(plant, disturbance, env_kwargs) for _ in range(n_envs)])
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.0)
```

**Monitoring:** Custom `TrainingMonitor` callback that evaluates on an **unnormalized** eval env every `eval_freq` steps, tracking both cumulative reward and ISE. When `save_best=True`, also uses SB3's `EvalCallback` to auto-save the best model by mean reward.

**Output:**
```python
@dataclass
class TrainingResult:
model_path: str # .zip (final model)
vecnorm_path: str # .pkl (normalization stats)
best_model_path: str | None # best model path (None only when save_best=False)
reward_history: list[float] # per-eval rewards
ise_history: list[float] # per-eval ISE values
final_reward: float
final_ise: float
training_time: float
algorithm: str
plant: str
disturbance: str
```

**File naming:** `{algo}_{plant}_{disturbance}.zip` / `{algo}_{plant}_{disturbance}_vecnorm.pkl`

**Batch helper:**
```python
results = train_all_surfaces(
plants=[...], algorithms=[...], disturbances=[...],
total_timesteps=500_000, output_dir="trained_models/",
base_seed=42, # each run gets base_seed + index for reproducibility
)
# 5 plants x 2 algos x 4 disturbances = 40 models
```

Seed propagation: `train_all_surfaces` assigns `seed = base_seed + i` where `i` is the flat index across the (plant, algorithm, disturbance) grid. Seeds are logged in each `TrainingResult` for reproducibility.

## Component 3: Benchmark

Orchestrates RL vs all controllers across all plants.

**Interface:**
```python
results = benchmark(
plants=["double_integrator", "inverted_pendulum", "crane", "quadrotor", "pmsm"],
trained_models_dir="trained_models/",
algorithms=["PPO", "SAC"],
disturbances=["none", "constant", "sinusoidal", "step"],
T=5.0, dt=0.01,
fixed_gains={"K": 5, "lam": 3, "phi": 0.05},
include_matched=True,
output_dir="benchmark_results/",
)
```

### Table 1 — Fixed switching law (240 simulations)

12 surfaces x 5 plants x 4 disturbances. All use `u = -K*sat(s/phi) - lam*s` with K=5, lam=3, phi=0.05. Isolates surface quality independent of controller design.

**The 12 surfaces:**
1. `LinearSurface`
2. `TerminalSurface`
3. `NonsingularTerminalSurface`
4. `FastTerminalSurface`
5. `IntegralSlidingSurface`
6. `IntegralTerminalSurface`
7. `HierarchicalSurface`
8. `PIDSurface`
9. `GlobalSurface`
10. `PredefinedTimeSurface`
11. `NonlinearDampingSurface`
12. `RLDiscoveredSurface` (loaded from trained model)

### Table 2 — Best-matched controller (340 simulations)

17 controllers x 5 plants x 4 disturbances. Each controller uses its best-matched surface via OpenSMC's composition API.

**The 17 controller configurations:**

OpenSMC uses composition: most controllers accept a `surface=` parameter. The benchmark constructs each pairing explicitly via a factory.

| # | Controller class | Surface | Composition |
|---|-----------------|---------|-------------|
| 1 | `ClassicalSMC` | `LinearSurface` | `ClassicalSMC(surface=LinearSurface(c=10))` |
| 2 | `AdaptiveSMC` | `NonlinearDampingSurface` | `AdaptiveSMC(surface=NonlinearDampingSurface(...))` |
| 3 | `DynamicSMC` | `LinearSurface` | `DynamicSMC(surface=LinearSurface(c=10))` |
| 4 | `ITSMC` | `IntegralTerminalSurface` | `ITSMC(surface=IntegralTerminalSurface(...))` |
| 5 | `NFTSMC` | `NonsingularTerminalSurface` | `NFTSMC(surface=NonsingularTerminalSurface(...))` |
| 6 | `FixedTimeSMC` | `PredefinedTimeSurface` | `FixedTimeSMC(surface=PredefinedTimeSurface(...))` |
| 7 | `FuzzySMC` | `LinearSurface` | `FuzzySMC(surface=LinearSurface(c=10))` |
| 8 | `DiscreteSMC` | `LinearSurface` | `DiscreteSMC(surface=LinearSurface(c=10))` |
| 9 | `CombiningHSMC` | `HierarchicalSurface` | `CombiningHSMC(surface=HierarchicalSurface(...))` |
| 10 | `AggregatedHSMC` | `HierarchicalSurface` | `AggregatedHSMC(surface=HierarchicalSurface(...))` |
| 11 | `IncrementalHSMC` | `HierarchicalSurface` | `IncrementalHSMC(surface=HierarchicalSurface(...))` |
| 12 | `TwistingSMC` | `LinearSurface` | `TwistingSMC(surface=LinearSurface(c=10))` |
| 13 | `QuasiContinuous2SMC` | `LinearSurface` | `QuasiContinuous2SMC(surface=LinearSurface(c=10))` |
| 14 | `NestedHOSMC` | n/a | `NestedHOSMC(order=2)` (standalone) |
| 15 | `QuasiContinuousHOSMC` | n/a | `QuasiContinuousHOSMC(order=2)` (standalone) |
| 16 | `PID` | n/a | `PID(Kp=10, Kd=5, Ki=0.5)` (standalone) |
| 17 | `LQR` | n/a | `LQR(A, B)` (standalone, plant-specific A/B) |

**Missing model handling:** If a trained model file is not found for a given (algorithm, plant, disturbance) combination, that RL entry is skipped with a warning logged. The benchmark still runs all classical controllers. This allows partial benchmarks (e.g., after training only PPO models).

**Metrics per simulation:** ISE, ITAE, settling time, overshoot, chattering index, steady-state error.

**Output:**
```python
@dataclass
class BenchmarkResults:
table1: pd.DataFrame # index: (surface, plant, disturbance) → metrics
table2: pd.DataFrame # index: (controller, plant, disturbance) → metrics
fingerprints: dict # {(algo, plant, disturbance): fingerprint_vector}
rankings: dict # {metric: ranked list of (surface/controller, score)}

def to_latex(self) -> str # Paper-ready LaTeX tables
def to_json(self, path: str) # Machine-readable export
def summary(self) -> str # Console-friendly summary
```

Total: 580 simulations, ~2 minutes wall time.

## Component 4: Visualize

Pure matplotlib, no GUI dependencies, publication-quality defaults.

**Functions:**
1. `surface_heatmap(model_path, plant, ...)` — sigma(e, edot) grid
2. `contour_overlay(model_path, reference_surfaces, ...)` — RL vs classical s=0 contours
3. `radar_chart(fingerprint_result)` — similarity to 10 known fingerprinting defaults
4. `cross_plant_radar(fingerprints_by_plant)` — side-by-side per plant
5. `benchmark_bars(benchmark_results, metric, plant)` — bar chart for all 12 surfaces (Table 1) or 17 controllers (Table 2)
6. `training_curve(training_result)` — reward + ISE over steps
7. `time_domain(benchmark_results, plant, top_n, disturbance)` — state/control trajectories

The radar chart uses the 10 fingerprinting defaults from `fingerprinting.get_default_known_surfaces()`. The benchmark bar charts show all 12 surfaces (Table 1) or all 17 controllers (Table 2). These are different views — fingerprinting answers "what did RL learn?", benchmarking answers "how well does it perform?".

All return `matplotlib.Figure`. Optional `save_path` for PNG/PDF export.

## Changes to Existing Files

1. **`opensmc/rl/rl_surface.py`:**
- `_load_sb3`: try `PPO.load()`, then `SAC.load()`, raise if neither works
- `_make_obs`: detect 4D models (trained on SurfaceDiscoveryEnv) and pad `[e, edot, 0.0, 0.0]` — zeros for sigma_prev and |u_prev| following autosmc's static extraction pattern

2. **`opensmc/rl/__init__.py`:** Expand public API with new imports.

3. **`pyproject.toml`:** Add `pandas>=1.5` to `[rl]` optional dependencies. Pandas is used by `benchmark.py` for results DataFrames. Users who only want training without benchmarking still get pandas — acceptable trade-off for a flat dependency surface.

## Testing

| Test file | Count | Coverage |
|-----------|-------|---------|
| `test_discovery_env.py` | ~15 | Gym API, all 5 plants, rewards, truncation, disturbances, PMSM edot |
| `test_trainer.py` | ~10 | PPO/SAC (5K steps), model save/load, VecNormalize, seed reproducibility |
| `test_benchmark.py` | ~10 | Fixed-law, matched-controller factory, metrics, missing model skip, LaTeX |
| `test_visualize.py` | ~7 | Each function returns Figure, save_path works |

Training tests use 5K steps (fast). Benchmark tests use 2 surfaces x 1 plant. All run without GPU.

## New Examples

| File | Purpose |
|------|---------|
| `examples/train_and_fingerprint.py` | Train PPO -> fingerprint -> radar chart |
| `examples/full_benchmark.py` | Full benchmark -> LaTeX tables |

## LOC Estimate

| Component | LOC |
|-----------|-----|
| discovery_env.py | ~250 |
| trainer.py | ~200 |
| benchmark.py | ~350 |
| visualize.py | ~250 |
| rl_surface.py edits | ~20 |
| __init__.py + pyproject.toml | ~12 |
| Tests (4 files) | ~300 |
| Examples (2 files) | ~100 |
| **Total** | **~1,480** |

## Source Lineage

Code is ported from `autosmc/` (D:/Ali_Kufa_University/Journals Paper/RL-Discovered-Sliding-Surfaces/autosmc/) and adapted to OpenSMC's class hierarchy. No code is imported at runtime — OpenSMC remains self-contained.
25 changes: 25 additions & 0 deletions python/examples/full_benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
"""Example: Run full benchmark — RL vs all controllers.

Usage:
cd D:/OpenSMC/python
python examples/full_benchmark.py
"""

from opensmc.rl import run_benchmark, visualize

print("Running benchmark (classical controllers only)...")
results = run_benchmark(
plants=["double_integrator", "inverted_pendulum"],
trained_models_dir=None,
algorithms=[],
disturbances=["none", "sinusoidal"],
T=5.0,
dt=0.01,
include_matched=True,
output_dir="benchmark_results",
)

print("\n" + results.summary())
print("\nLaTeX (first 20 lines):")
print("\n".join(results.to_latex().split("\n")[:20]))
print(f"\nResults saved to benchmark_results/")
41 changes: 41 additions & 0 deletions python/examples/train_and_fingerprint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""Example: Train PPO on DoubleIntegrator, fingerprint, plot radar chart.

Usage:
cd D:/OpenSMC/python
python examples/train_and_fingerprint.py
"""

import numpy as np
from opensmc.rl import train_surface, RLDiscoveredSurface
from opensmc.rl import fingerprinting, visualize

print("Training PPO on DoubleIntegrator...")
result = train_surface(
plant="double_integrator",
algorithm="PPO",
disturbance="sinusoidal",
total_timesteps=50_000,
n_envs=4,
output_dir="demo_models",
seed=42,
)
print(f"Training complete in {result.training_time:.1f}s")
print(f"Final reward: {result.final_reward:.2f}, ISE: {result.final_ise:.4f}")

surface = RLDiscoveredSurface(result.model_path, use_4d_obs=True)

fp = fingerprinting.fingerprint(surface)
print("\nFingerprint scores:")
for name, score in sorted(fp.items(), key=lambda x: -x[1]):
print(f" {name:25s} {score:.3f}")

visualize.training_curve(
result.reward_history, result.ise_history,
save_path="demo_models/training_curve.png"
)
visualize.surface_heatmap(
sigma_fn=lambda e, ed: surface.compute(e, ed),
save_path="demo_models/surface_heatmap.png"
)
visualize.radar_chart(fp, save_path="demo_models/radar.png")
print("\nFigures saved to demo_models/")
Loading
Loading