Balghanimi · Balghanimi · Mar 19, 2026 · Mar 19, 2026 · Mar 19, 2026 · Mar 19, 2026
diff --git a/docs/superpowers/plans/2026-03-19-phase-d-rl-integration.md b/docs/superpowers/plans/2026-03-19-phase-d-rl-integration.md
diff --git a/docs/superpowers/specs/2026-03-19-phase-d-rl-integration-design.md b/docs/superpowers/specs/2026-03-19-phase-d-rl-integration-design.md
@@ -0,0 +1,320 @@
+# OpenSMC Phase D: RL Integration — Design Spec
+
+**Date:** 2026-03-19
+**Status:** Approved
+**Scope:** Port autosmc RL infrastructure into `opensmc` Python package + full benchmark suite
+
+## Decisions
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Goal | Port infrastructure + run benchmark | Both needed for Paper 3 |
+| Plants | 5 existing Gym envs | DoubleInt, Crane, Quadrotor, InvPendulum, PMSM — strong cross-section |
+| RL formulation | Surface discovery (agent outputs sigma) | Core novelty; enables fingerprinting |
+| Benchmark tables | Fixed law + best-matched controller | Two stories: surface quality + practical performance |
+| RL algorithms | PPO + SAC | On-policy + off-policy proves algorithm-independence |
+| Training mode | Independent per plant | Fingerprinting answers cross-plant question without curriculum |
+
+## Module Structure
+
+```
+opensmc/rl/
+├── __init__.py            # public API (expanded)
+├── rl_surface.py          # EXISTS — fix: SAC loading + 4D obs padding
+├── fingerprinting.py      # EXISTS — no changes
+├── discovery_env.py       # NEW — SurfaceDiscoveryEnv (agent outputs sigma)
+├── trainer.py             # NEW — PPO/SAC training with VecNormalize
+├── benchmark.py           # NEW — RL vs 17 controllers x 5 plants
+└── visualize.py           # NEW — heatmaps, radar, contours, bar charts
+```
+
+## Component 1: SurfaceDiscoveryEnv
+
+Gymnasium environment where the RL agent discovers sliding surfaces. Ported from `autosmc/envs/discovery_env.py`, adapted to wrap OpenSMC `Plant` classes.
+
+**Interface:**
+```python
+env = SurfaceDiscoveryEnv(
+    plant="double_integrator",   # string or Plant instance
+    disturbance="sinusoidal",    # none | constant | sinusoidal | step
+    disturbance_amplitude=1.0,
+    dt=0.01,
+    max_steps=500,               # 5 seconds
+    sigma_max=20.0,
+    control_gains={"K": 5, "lam": 3, "phi": 0.05},
+)
+```
+
+**String-to-plant mapping:** When `plant` is a string, the env resolves it via:
+```python
+PLANT_REGISTRY = {
+    "double_integrator": DoubleIntegrator,
+    "inverted_pendulum": InvertedPendulum,
+    "crane": Crane,
+    "quadrotor": Quadrotor,
+    "pmsm": PMSM,
+}
+```
+When `plant` is a `Plant` instance, it is used directly. The env calls `plant.dynamics(t, x, u)` for RK4 integration.
+
+**Spaces:**
+- Observation: `Box(4)` — `[e, edot, sigma_prev, |u_prev|]`
+  - low: `[-10, -10, -50, 0]`, high: `[10, 10, 50, 100]`
+- Action: `Box(1)` — `[-1, 1]` scaled to `[-sigma_max, sigma_max]`
+
+**Control law:** `u = -K * sat(sigma / phi) - lam * sigma`
+
+This is a self-contained control law internal to the env, NOT composed with an OpenSMC `Controller` instance. This is intentional: the env must be a standard Gymnasium environment with a simple `step(action) -> obs` interface. The RL agent discovers the surface; the fixed control law converts it to a control input. This matches the autosmc formulation and ensures the training reward reflects surface quality, not controller design.
+
+**Reward:**
+```
+r = -1.0 * e^2 - 0.01 * u^2 - 0.005 * du^2
+  + 5.0 * dt    (when |e| < 0.02 and |edot| < 0.05)
+  - 50          (on truncation: |x[0]| > 10 or |x[1]| > 20)
+```
+
+**Plant-to-error mapping:**
+
+| Plant | e | edot | How |
+|-------|---|------|-----|
+| DoubleIntegrator | `x[0] - ref` | `x[1]` | direct state |
+| InvertedPendulum | `x[2]` (theta) | `x[3]` (theta_dot) | direct state |
+| Crane | `x[0] - ref` (trolley) | `x[1]` (trolley_dot) | direct state |
+| Quadrotor | `x[2] - ref` (z) | `x[8]` (vz) | direct state |
+| PMSM | `x[2] - ref` (omega) | `(1.5*pp*psi_f*x[1] - B*x[2] - TL) / J` | computed from dynamics |
+
+PMSM `edot` is computed from the motor dynamics: `domega/dt = (Te - B*omega - TL) / J` where `Te = 1.5 * pp * psi_f * i_q` (non-salient pole simplification). The env accesses `i_q = x[1]` from the PMSM state vector `[i_d, i_q, omega, theta]`. Load torque `TL` is separate from the disturbance `d` (TL is constant mechanical load, d is added to the dynamics as external perturbation).
+
+**Integration:** RK4 with 4 substeps per dt.
+
+**Registration:** `OpenSMC/SurfaceDiscovery-v0` with plant passed via `env_kwargs`.
+
+## Component 2: Trainer
+
+Wraps stable-baselines3 PPO/SAC with OpenSMC-validated defaults.
+
+**Interface:**
+```python
+result = train_surface(
+    plant="double_integrator",
+    algorithm="PPO",              # "PPO" | "SAC"
+    disturbance="sinusoidal",
+    total_timesteps=500_000,
+    n_envs=4,
+    net_arch=[64, 64],
+    seed=42,
+    output_dir="trained_models/",
+    eval_freq=10_000,
+    save_best=True,
+    normalize_obs=True,
+    normalize_reward=True,
+    clip_obs=10.0,
+    env_kwargs=None,
+)
+```
+
+**Algorithm defaults:**
+
+| Parameter | PPO | SAC |
+|-----------|-----|-----|
+| learning_rate | 3e-4 | 3e-4 |
+| n_steps / buffer_size | 512 | 100_000 |
+| batch_size | 64 | 256 |
+| n_epochs / train_freq | 10 | 1 |
+| gamma | 0.99 | 0.99 |
+| gae_lambda / tau | 0.95 | 0.005 |
+| clip_range | 0.2 | n/a |
+| net_arch | [64, 64] | [64, 64] |
+
+No learning rate scheduling — constant 3e-4.
+
+**VecNormalize setup:**
+```python
+def _make_env(plant, disturbance, env_kwargs):
+    def _init():
+        return SurfaceDiscoveryEnv(plant=plant, disturbance=disturbance, **(env_kwargs or {}))
+    return _init
+
+env = DummyVecEnv([_make_env(plant, disturbance, env_kwargs) for _ in range(n_envs)])
+env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.0)
+```
+
+**Monitoring:** Custom `TrainingMonitor` callback that evaluates on an **unnormalized** eval env every `eval_freq` steps, tracking both cumulative reward and ISE. When `save_best=True`, also uses SB3's `EvalCallback` to auto-save the best model by mean reward.
+
+**Output:**
+```python
+@dataclass
+class TrainingResult:
+    model_path: str              # .zip (final model)
+    vecnorm_path: str            # .pkl (normalization stats)
+    best_model_path: str | None  # best model path (None only when save_best=False)
+    reward_history: list[float]  # per-eval rewards
+    ise_history: list[float]     # per-eval ISE values
+    final_reward: float
+    final_ise: float
+    training_time: float
+    algorithm: str
+    plant: str
+    disturbance: str
+```
+
+**File naming:** `{algo}_{plant}_{disturbance}.zip` / `{algo}_{plant}_{disturbance}_vecnorm.pkl`
+
+**Batch helper:**
+```python
+results = train_all_surfaces(
+    plants=[...], algorithms=[...], disturbances=[...],
+    total_timesteps=500_000, output_dir="trained_models/",
+    base_seed=42,  # each run gets base_seed + index for reproducibility
+)
+# 5 plants x 2 algos x 4 disturbances = 40 models
+```
+
+Seed propagation: `train_all_surfaces` assigns `seed = base_seed + i` where `i` is the flat index across the (plant, algorithm, disturbance) grid. Seeds are logged in each `TrainingResult` for reproducibility.
+
+## Component 3: Benchmark
+
+Orchestrates RL vs all controllers across all plants.
+
+**Interface:**
+```python
+results = benchmark(
+    plants=["double_integrator", "inverted_pendulum", "crane", "quadrotor", "pmsm"],
+    trained_models_dir="trained_models/",
+    algorithms=["PPO", "SAC"],
+    disturbances=["none", "constant", "sinusoidal", "step"],
+    T=5.0, dt=0.01,
+    fixed_gains={"K": 5, "lam": 3, "phi": 0.05},
+    include_matched=True,
+    output_dir="benchmark_results/",
+)
+```
+
+### Table 1 — Fixed switching law (240 simulations)
+
+12 surfaces x 5 plants x 4 disturbances. All use `u = -K*sat(s/phi) - lam*s` with K=5, lam=3, phi=0.05. Isolates surface quality independent of controller design.
+
+**The 12 surfaces:**
+1. `LinearSurface`
+2. `TerminalSurface`
+3. `NonsingularTerminalSurface`
+4. `FastTerminalSurface`
+5. `IntegralSlidingSurface`
+6. `IntegralTerminalSurface`
+7. `HierarchicalSurface`
+8. `PIDSurface`
+9. `GlobalSurface`
+10. `PredefinedTimeSurface`
+11. `NonlinearDampingSurface`
+12. `RLDiscoveredSurface` (loaded from trained model)
+
+### Table 2 — Best-matched controller (340 simulations)
+
+17 controllers x 5 plants x 4 disturbances. Each controller uses its best-matched surface via OpenSMC's composition API.
+
+**The 17 controller configurations:**
+
+OpenSMC uses composition: most controllers accept a `surface=` parameter. The benchmark constructs each pairing explicitly via a factory.
+
+| # | Controller class | Surface | Composition |
+|---|-----------------|---------|-------------|
+| 1 | `ClassicalSMC` | `LinearSurface` | `ClassicalSMC(surface=LinearSurface(c=10))` |
+| 2 | `AdaptiveSMC` | `NonlinearDampingSurface` | `AdaptiveSMC(surface=NonlinearDampingSurface(...))` |
+| 3 | `DynamicSMC` | `LinearSurface` | `DynamicSMC(surface=LinearSurface(c=10))` |
+| 4 | `ITSMC` | `IntegralTerminalSurface` | `ITSMC(surface=IntegralTerminalSurface(...))` |
+| 5 | `NFTSMC` | `NonsingularTerminalSurface` | `NFTSMC(surface=NonsingularTerminalSurface(...))` |
+| 6 | `FixedTimeSMC` | `PredefinedTimeSurface` | `FixedTimeSMC(surface=PredefinedTimeSurface(...))` |
+| 7 | `FuzzySMC` | `LinearSurface` | `FuzzySMC(surface=LinearSurface(c=10))` |
+| 8 | `DiscreteSMC` | `LinearSurface` | `DiscreteSMC(surface=LinearSurface(c=10))` |
+| 9 | `CombiningHSMC` | `HierarchicalSurface` | `CombiningHSMC(surface=HierarchicalSurface(...))` |
+| 10 | `AggregatedHSMC` | `HierarchicalSurface` | `AggregatedHSMC(surface=HierarchicalSurface(...))` |
+| 11 | `IncrementalHSMC` | `HierarchicalSurface` | `IncrementalHSMC(surface=HierarchicalSurface(...))` |
+| 12 | `TwistingSMC` | `LinearSurface` | `TwistingSMC(surface=LinearSurface(c=10))` |
+| 13 | `QuasiContinuous2SMC` | `LinearSurface` | `QuasiContinuous2SMC(surface=LinearSurface(c=10))` |
+| 14 | `NestedHOSMC` | n/a | `NestedHOSMC(order=2)` (standalone) |
+| 15 | `QuasiContinuousHOSMC` | n/a | `QuasiContinuousHOSMC(order=2)` (standalone) |
+| 16 | `PID` | n/a | `PID(Kp=10, Kd=5, Ki=0.5)` (standalone) |
+| 17 | `LQR` | n/a | `LQR(A, B)` (standalone, plant-specific A/B) |
+
+**Missing model handling:** If a trained model file is not found for a given (algorithm, plant, disturbance) combination, that RL entry is skipped with a warning logged. The benchmark still runs all classical controllers. This allows partial benchmarks (e.g., after training only PPO models).
+
+**Metrics per simulation:** ISE, ITAE, settling time, overshoot, chattering index, steady-state error.
+
+**Output:**
+```python
+@dataclass
+class BenchmarkResults:
+    table1: pd.DataFrame          # index: (surface, plant, disturbance) → metrics
+    table2: pd.DataFrame          # index: (controller, plant, disturbance) → metrics
+    fingerprints: dict            # {(algo, plant, disturbance): fingerprint_vector}
+    rankings: dict                # {metric: ranked list of (surface/controller, score)}
+
+    def to_latex(self) -> str     # Paper-ready LaTeX tables
+    def to_json(self, path: str)  # Machine-readable export
+    def summary(self) -> str      # Console-friendly summary
+```
+
+Total: 580 simulations, ~2 minutes wall time.
+
+## Component 4: Visualize
+
+Pure matplotlib, no GUI dependencies, publication-quality defaults.
+
+**Functions:**
+1. `surface_heatmap(model_path, plant, ...)` — sigma(e, edot) grid
+2. `contour_overlay(model_path, reference_surfaces, ...)` — RL vs classical s=0 contours
+3. `radar_chart(fingerprint_result)` — similarity to 10 known fingerprinting defaults
+4. `cross_plant_radar(fingerprints_by_plant)` — side-by-side per plant
+5. `benchmark_bars(benchmark_results, metric, plant)` — bar chart for all 12 surfaces (Table 1) or 17 controllers (Table 2)
+6. `training_curve(training_result)` — reward + ISE over steps
+7. `time_domain(benchmark_results, plant, top_n, disturbance)` — state/control trajectories
+
+The radar chart uses the 10 fingerprinting defaults from `fingerprinting.get_default_known_surfaces()`. The benchmark bar charts show all 12 surfaces (Table 1) or all 17 controllers (Table 2). These are different views — fingerprinting answers "what did RL learn?", benchmarking answers "how well does it perform?".
+
+All return `matplotlib.Figure`. Optional `save_path` for PNG/PDF export.
+
+## Changes to Existing Files
+
+1. **`opensmc/rl/rl_surface.py`:**
+   - `_load_sb3`: try `PPO.load()`, then `SAC.load()`, raise if neither works
+   - `_make_obs`: detect 4D models (trained on SurfaceDiscoveryEnv) and pad `[e, edot, 0.0, 0.0]` — zeros for sigma_prev and |u_prev| following autosmc's static extraction pattern
+
+2. **`opensmc/rl/__init__.py`:** Expand public API with new imports.
+
+3. **`pyproject.toml`:** Add `pandas>=1.5` to `[rl]` optional dependencies. Pandas is used by `benchmark.py` for results DataFrames. Users who only want training without benchmarking still get pandas — acceptable trade-off for a flat dependency surface.
+
+## Testing
+
+| Test file | Count | Coverage |
+|-----------|-------|---------|
+| `test_discovery_env.py` | ~15 | Gym API, all 5 plants, rewards, truncation, disturbances, PMSM edot |
+| `test_trainer.py` | ~10 | PPO/SAC (5K steps), model save/load, VecNormalize, seed reproducibility |
+| `test_benchmark.py` | ~10 | Fixed-law, matched-controller factory, metrics, missing model skip, LaTeX |
+| `test_visualize.py` | ~7 | Each function returns Figure, save_path works |
+
+Training tests use 5K steps (fast). Benchmark tests use 2 surfaces x 1 plant. All run without GPU.
+
+## New Examples
+
+| File | Purpose |
+|------|---------|
+| `examples/train_and_fingerprint.py` | Train PPO -> fingerprint -> radar chart |
+| `examples/full_benchmark.py` | Full benchmark -> LaTeX tables |
+
+## LOC Estimate
+
+| Component | LOC |
+|-----------|-----|
+| discovery_env.py | ~250 |
+| trainer.py | ~200 |
+| benchmark.py | ~350 |
+| visualize.py | ~250 |
+| rl_surface.py edits | ~20 |
+| __init__.py + pyproject.toml | ~12 |
+| Tests (4 files) | ~300 |
+| Examples (2 files) | ~100 |
+| **Total** | **~1,480** |
+
+## Source Lineage
+
+Code is ported from `autosmc/` (D:/Ali_Kufa_University/Journals Paper/RL-Discovered-Sliding-Surfaces/autosmc/) and adapted to OpenSMC's class hierarchy. No code is imported at runtime — OpenSMC remains self-contained.
diff --git a/python/examples/full_benchmark.py b/python/examples/full_benchmark.py
@@ -0,0 +1,25 @@
+"""Example: Run full benchmark — RL vs all controllers.
+
+Usage:
+    cd D:/OpenSMC/python
+    python examples/full_benchmark.py
+"""
+
+from opensmc.rl import run_benchmark, visualize
+
+print("Running benchmark (classical controllers only)...")
+results = run_benchmark(
+    plants=["double_integrator", "inverted_pendulum"],
+    trained_models_dir=None,
+    algorithms=[],
+    disturbances=["none", "sinusoidal"],
+    T=5.0,
+    dt=0.01,
+    include_matched=True,
+    output_dir="benchmark_results",
+)
+
+print("\n" + results.summary())
+print("\nLaTeX (first 20 lines):")
+print("\n".join(results.to_latex().split("\n")[:20]))
+print(f"\nResults saved to benchmark_results/")
diff --git a/python/examples/train_and_fingerprint.py b/python/examples/train_and_fingerprint.py
@@ -0,0 +1,41 @@
+"""Example: Train PPO on DoubleIntegrator, fingerprint, plot radar chart.
+
+Usage:
+    cd D:/OpenSMC/python
+    python examples/train_and_fingerprint.py
+"""
+
+import numpy as np
+from opensmc.rl import train_surface, RLDiscoveredSurface
+from opensmc.rl import fingerprinting, visualize
+
+print("Training PPO on DoubleIntegrator...")
+result = train_surface(
+    plant="double_integrator",
+    algorithm="PPO",
+    disturbance="sinusoidal",
+    total_timesteps=50_000,
+    n_envs=4,
+    output_dir="demo_models",
+    seed=42,
+)
+print(f"Training complete in {result.training_time:.1f}s")
+print(f"Final reward: {result.final_reward:.2f}, ISE: {result.final_ise:.4f}")
+
+surface = RLDiscoveredSurface(result.model_path, use_4d_obs=True)
+
+fp = fingerprinting.fingerprint(surface)
+print("\nFingerprint scores:")
+for name, score in sorted(fp.items(), key=lambda x: -x[1]):
+    print(f"  {name:25s} {score:.3f}")
+
+visualize.training_curve(
+    result.reward_history, result.ise_history,
+    save_path="demo_models/training_curve.png"
+)
+visualize.surface_heatmap(
+    sigma_fn=lambda e, ed: surface.compute(e, ed),
+    save_path="demo_models/surface_heatmap.png"
+)
+visualize.radar_chart(fp, save_path="demo_models/radar.png")
+print("\nFigures saved to demo_models/")