diff --git a/.env.local b/.env.example
similarity index 58%
rename from .env.local
rename to .env.example
index 824375a..b0e7fd6 100644
--- a/.env.local
+++ b/.env.example
@@ -1,11 +1,11 @@
 # NSM Project Environment Configuration
-# Source this file before running experiments
+# Copy this file to .env.local and customize for your local setup
 
 # Primary repository path for baseline tracking
-export NSM_REPO_ROOT="/Users/preston/Projects/NSM"
+export NSM_REPO_ROOT="/path/to/your/NSM"
 
 # Baseline tracking file
 export NSM_BASELINES_FILE="${NSM_REPO_ROOT}/baselines.jsonl"
 
 # Worktree directory for parallel experiments
-export NSM_WORKTREE_ROOT="/Users/preston/Projects"
+export NSM_WORKTREE_ROOT="/path/to/your/worktrees"
diff --git a/.gitignore b/.gitignore
index a8c8826..e7f04d6 100644
--- a/.gitignore
+++ b/.gitignore
@@ -3,6 +3,7 @@
 
 # Environment variables
 .env
+.env.local
 
 # Python
 __pycache__/
diff --git a/CHECKPOINT_INTEGRATION_SUMMARY.md b/CHECKPOINT_INTEGRATION_SUMMARY.md
new file mode 100644
index 0000000..b48dbc1
--- /dev/null
+++ b/CHECKPOINT_INTEGRATION_SUMMARY.md
@@ -0,0 +1,306 @@
+# Checkpoint Storage & CGT Integration Setup
+
+**Date**: 2025-10-23
+**Status**: ✅ Complete - Ready for use
+
+---
+
+## Summary
+
+Created comprehensive checkpoint management system for NSM experiments with full CGT integration. Checkpoints are now stored in both Modal volumes and the local repo, enabling trained models to be loaded into CGT validation experiments.
+
+## What Was Created
+
+### 1. Checkpoint Manager (`nsm/utils/checkpoint_manager.py`)
+
+Unified checkpoint saving/loading with metadata tracking:
+
+```python
+from nsm.utils.checkpoint_manager import CheckpointManager, save_nsm_checkpoint
+
+# During training
+checkpoint_manager = CheckpointManager("/checkpoints", "nsm-10x-baseline")
+checkpoint_manager.save_checkpoint(
+    model=model,
+    epoch=15,
+    metrics={"val_accuracy": 0.67},
+    config=config,
+    is_best=True  # Saves as nsm-10x-baseline_best.pt
+)
+
+# For CGT validation
+checkpoint = checkpoint_manager.load_best_checkpoint(model, device='cuda')
+```
+
+**Features**:
+- Saves model state, optimizer state, metrics, and config
+- Tracks best model separately (`*_best.pt`)
+- Generates JSON metadata for easy inspection
+- Works in both local and Modal environments
+
+### 2. Checkpoint Download Script (`scripts/download_checkpoints.py`)
+
+Downloads checkpoints from Modal volume to local repo:
+
+```bash
+# Download all checkpoints
+python scripts/download_checkpoints.py
+
+# Download specific pattern
+python scripts/download_checkpoints.py --pattern "*best*"
+
+# Custom destination
+python scripts/download_checkpoints.py --destination my_checkpoints/
+```
+
+### 3. CGT Full Training Script (`nsm-cgt/experiments/modal_cgt_full_training.py`)
+
+Production-ready CGT training with checkpoint integration:
+
+```bash
+# Train from scratch (15 epochs like NSM-33)
+modal run experiments/modal_cgt_full_training.py::train_from_scratch
+
+# Load NSM-33 checkpoint and continue training
+modal run experiments/modal_cgt_full_training.py::train_from_checkpoint \
+  --checkpoint=nsm-10x-baseline_best.pt
+
+# Just track CGT operators on existing checkpoint (no training)
+modal run experiments/modal_cgt_full_training.py::track_checkpoint \
+  --checkpoint=nsm-10x-baseline_best.pt
+```
+
+**Key Features**:
+- Full 15-epoch training (vs previous 5-epoch minimal)
+- CGT operator tracking at every epoch
+- Loads pre-trained NSM-33 models as initialization
+- Saves checkpoints with CGT metrics included
+- Graceful handling of missing checkpoints
+
+---
+
+## Current Checkpoint Status
+
+### Modal Volume (`nsm-checkpoints`)
+
+**Results Files** (JSON):
+- `10x_baseline_results.json` - 66% accuracy, 15 epochs
+- `10x_fixed_temp_results.json` - 65.57% accuracy, 15 epochs
+
+**Model Checkpoints** (.pt):
+- ⚠️ **None yet** - Current scripts only save results, not models
+
+**Dataset Directories**:
+- `planning/` - Planning dataset cache
+- `kg/` - Knowledge graph dataset cache
+- `causal/` - Causal reasoning dataset cache
+
+### Local Repo (`checkpoints/`)
+
+**Currently**:
+- `10x_baseline_results.json` (downloaded)
+- Empty otherwise (no .pt files)
+
+**After Next Training Run**:
+- `nsm-10x-baseline_best.pt` - Best model checkpoint
+- `nsm-10x-baseline_epoch15_*.pt` - Final epoch
+- `nsm-cgt-planning_best.pt` - CGT-tracked model
+- Etc.
+
+---
+
+## Integration Workflow
+
+### Step 1: Add Checkpoint Saving to NSM-33 Experiments
+
+Current NSM-33 scripts (`modal_10x_baseline.py`, etc.) need modification to save model checkpoints:
+
+```python
+# Add to imports
+from nsm.utils.checkpoint_manager import save_nsm_checkpoint
+
+# In training loop, after validation
+if val_accuracy > best_val_accuracy:
+    best_val_accuracy = val_accuracy
+
+    # NEW: Save checkpoint
+    save_nsm_checkpoint(
+        model=model,
+        epoch=epoch + 1,
+        val_accuracy=val_accuracy,
+        config=config,
+        checkpoint_dir="/checkpoints",
+        experiment_name="nsm-10x-baseline",
+        is_best=True
+    )
+```
+
+**Action Required**: Modify existing Modal scripts to add checkpoint saving
+
+### Step 2: Download Checkpoints to Repo
+
+After training runs complete:
+
+```bash
+cd /Users/preston/Projects/NSM
+python scripts/download_checkpoints.py
+```
+
+This populates `checkpoints/` with trained models.
+
+### Step 3: Use Checkpoints in CGT
+
+```bash
+cd /Users/preston/Projects/nsm-cgt
+
+# Track CGT operators on NSM-33 baseline
+modal run experiments/modal_cgt_full_training.py::track_checkpoint \
+  --checkpoint=nsm-10x-baseline_best.pt
+
+# Or train further with CGT tracking
+modal run experiments/modal_cgt_full_training.py::train_from_checkpoint \
+  --checkpoint=nsm-10x-baseline_best.pt --epochs=20
+```
+
+---
+
+## File Organization
+
+```
+NSM/
+├── checkpoints/              # Local checkpoint storage
+│   ├── 10x_baseline_results.json
+│   ├── nsm-10x-baseline_best.pt  (after next run)
+│   └── *.json (metadata)
+│
+├── nsm/utils/
+│   └── checkpoint_manager.py  # Checkpoint utilities
+│
+├── scripts/
+│   └── download_checkpoints.py  # Modal → local sync
+│
+└── experiments/
+    └── modal_10x_*.py  # Need modification to save checkpoints
+
+nsm-cgt/  (worktree)
+└── experiments/
+    ├── modal_cgt_full_training.py  # NEW: Full training + CGT
+    ├── modal_cgt_validation.py     # Updated with health checks
+    └── modal_cgt_training.py       # Original 5-epoch version
+```
+
+---
+
+## Next Steps
+
+### Immediate (To Start Using Checkpoints)
+
+1. **Modify NSM-33 baseline script** to save checkpoints:
+   ```bash
+   # Edit: experiments/modal_10x_baseline.py
+   # Add checkpoint saving in training loop (lines ~390-400)
+   ```
+
+2. **Rerun one NSM-33 experiment** to generate checkpoint:
+   ```bash
+   modal run experiments/modal_10x_baseline.py::validate_10x_baseline
+   ```
+
+3. **Download checkpoint** to repo:
+   ```bash
+   python scripts/download_checkpoints.py
+   ```
+
+4. **Run CGT tracking** on trained model:
+   ```bash
+   cd ../nsm-cgt
+   modal run experiments/modal_cgt_full_training.py::track_checkpoint \
+     --checkpoint=nsm-10x-baseline_best.pt
+   ```
+
+### Future Enhancements
+
+- **Auto-sync**: Cron job or GitHub Action to download checkpoints nightly
+- **Checkpoint browser**: Web UI to visualize checkpoint metrics
+- **Multi-checkpoint comparison**: CGT tracking across multiple checkpoints in parallel
+- **Git LFS**: Use Git Large File Storage for .pt files (currently gitignored)
+
+---
+
+## Benefits
+
+**Before**:
+- ❌ No model checkpoints saved
+- ❌ CGT tested on untrained models (temp = 0.00)
+- ❌ Could not compare CGT across training stages
+- ❌ Results not reproducible (models discarded)
+
+**After**:
+- ✅ Models saved with full metadata
+- ✅ CGT validated on production-trained models
+- ✅ Track temperature evolution across epochs
+- ✅ Reproducible results (load any checkpoint)
+- ✅ Seamless Modal ↔ Local workflow
+
+---
+
+## Example Usage
+
+### Train NSM with Checkpoints (Once Scripts Modified)
+
+```bash
+# Run NSM-33 baseline with checkpoint saving
+modal run experiments/modal_10x_baseline.py::validate_10x_baseline
+
+# Check Modal volume
+modal volume ls nsm-checkpoints
+# Output:
+#   nsm-10x-baseline_best.pt
+#   nsm-10x-baseline_epoch15_*.pt
+#   10x_baseline_results.json
+```
+
+### Download & Use in CGT
+
+```bash
+# Download to local repo
+python scripts/download_checkpoints.py
+
+# Verify download
+ls -lh checkpoints/*.pt
+# Output:
+#   nsm-10x-baseline_best.pt (47 MB)
+
+# Track CGT operators on trained model
+cd ../nsm-cgt
+modal run experiments/modal_cgt_full_training.py::track_checkpoint \
+  --checkpoint=nsm-10x-baseline_best.pt
+
+# Expected output:
+#   ✅ Loaded checkpoint from epoch 15
+#   📊 Tracking CGT operators...
+#   Conway Temperature: 0.3521 (healthy zone)
+#   Cooling Rate: -0.0023
+#   ✅ CGT Temperature: 0.3521
+```
+
+---
+
+## Current Status of Multi-Seed Experiments
+
+While building checkpoint system, multi-seed experiments are still running:
+
+- **Seed 42 Fixed Temp**: Epoch 7/15, accuracy 63.44%
+- **Seed 42 Baseline**: Failed (Modal timeout - not code issue)
+- **Seeds 123, 456, 789, 1011**: Queued/running
+
+Once complete, can use `download_checkpoints.py` to fetch all best models for analysis.
+
+---
+
+## Questions?
+
+See:
+- `nsm/utils/checkpoint_manager.py` - Implementation details
+- `experiments/modal_cgt_full_training.py` - Usage examples
+- `scripts/download_checkpoints.py` - Download workflow
diff --git a/PR17_CRITICAL_FIXES.md b/PR17_CRITICAL_FIXES.md
new file mode 100644
index 0000000..caf0bb3
--- /dev/null
+++ b/PR17_CRITICAL_FIXES.md
@@ -0,0 +1,262 @@
+# PR #17 Critical Bug Fixes - Chiral Architecture
+
+**Status**: REQUIRED BEFORE MERGE
+**Priority**: CRITICAL - Blocks PR #17 approval
+**File**: `nsm/models/chiral.py` (721 lines)
+**Branch**: `phase1a-merge-causal-3level-to-causal`
+
+## Summary
+
+PR #17 received **CONDITIONAL APPROVAL** from Claude Code Review with 3 critical code issues that must be fixed before merge. These issues affect gradient flow, information loss, and numerical stability in the 6-level chiral architecture.
+
+**Review Decision**: "⚠️ CONDITIONAL APPROVAL - Recommend addressing critical issues before merge"
+
+## Critical Issues Identified
+
+### Issue #1: Gradient Instability in 6-Level Model (CRITICAL)
+
+**Location**: Lines 594-598
+**Severity**: CRITICAL - Breaks gradient computation graph
+
+**Problem**: In-place tensor assignments break PyTorch's autograd
+```python
+# CURRENT (BROKEN):
+x_l3_to_l2 = torch.zeros(num_l2_nodes, self.node_features, device=x_l1.device)
+x_l3_to_l2[perm_l3] = x_l3_refined  # ⚠️ In-place assignment breaks gradients
+
+x_l2_to_l1 = torch.zeros_like(x_l1)
+x_l2_to_l1[perm_l2] = self.reconstruct_l1_from_l3(x_l3_to_l2)  # ⚠️ Same issue
+```
+
+**Root Cause**: Direct indexing assignment (`tensor[indices] = values`) creates a new tensor that doesn't track gradients through the computational graph.
+
+**Fix**: Use `scatter_` with proper gradient tracking
+```python
+# FIXED:
+x_l3_to_l2 = torch.zeros(num_l2_nodes, self.node_features, device=x_l1.device)
+# scatter_ properly tracks gradients through indexing operations
+x_l3_to_l2.scatter_(0, perm_l3.unsqueeze(1).expand(-1, self.node_features), x_l3_refined)
+
+x_l2_to_l1 = torch.zeros_like(x_l1)
+x_l2_to_l1.scatter_(0, perm_l2.unsqueeze(1).expand(-1, self.node_features),
+                   self.reconstruct_l1_from_l3(x_l3_to_l2))
+```
+
+**Impact**: Without this fix, gradients will not propagate through the unpooling operations, leading to:
+- Poor training convergence
+- Inability to learn proper cycle reconstruction
+- Potentially vanishing gradients in L3 representations
+
+---
+
+### Issue #2: Lossy Size Alignment Algorithm (HIGH)
+
+**Location**: Lines 451-456
+**Severity**: HIGH - Information loss and gradient issues
+
+**Problem**: Naive nearest neighbor interpolation loses information and doesn't use `perm_large` parameter
+```python
+# CURRENT (LOSSY):
+# Map each large node to nearest small node (simple nearest neighbor)
+indices = (torch.arange(num_large, device=x_small.device).float() * (num_small / num_large)).long()
+indices = torch.clamp(indices, 0, num_small - 1)
+x_aligned = x_small[indices]  # Information loss, perm_large parameter UNUSED
+```
+
+**Root Cause**:
+1. The current implementation ignores the `perm_large` parameter entirely
+2. Nearest neighbor interpolation creates duplicate values instead of proper unpooling
+3. Doesn't properly invert the pooling operation
+
+**Fix**: Implement proper unpooling using `perm_large`
+```python
+# FIXED (Option B - Proper Unpooling):
+x_aligned = torch.zeros(num_large, dim, device=x_small.device, dtype=x_small.dtype)
+
+# Place small tensor values at positions specified by perm_large
+# Handles case where num_small < num_large by only using valid indices
+valid_size = min(num_small, perm_large.size(0))
+x_aligned[perm_large[:valid_size]] = x_small[:valid_size]
+```
+
+**Impact**: Without this fix:
+- Information is duplicated instead of properly distributed
+- The `perm_large` pooling indices are completely ignored
+- Unpooling doesn't properly invert pooling
+- Gradient flow is suboptimal
+
+---
+
+### Issue #3: Numerical Stability in Normalization (MEDIUM)
+
+**Location**: Lines 395-398 and 419-420
+**Severity**: MEDIUM - Can cause NaN/Inf in edge cases
+
+**Problem**: Replacing near-zero scale with 1.0 causes incorrect normalization
+```python
+# CURRENT (UNSTABLE):
+# In _normalize_features:
+scale = max_val - min_val
+scale = torch.where(scale < 1e-8, torch.ones_like(scale), scale)  # ⚠️ Incorrect
+x_normalized = (x - min_val) / scale  # When scale was near zero, this is wrong
+
+# In _denormalize_features:
+scale = max_val - min_val
+scale = torch.where(scale < 1e-8, torch.ones_like(scale), scale)  # ⚠️ Asymmetric
+return x_normalized * scale + min_val  # Doesn't match normalization
+```
+
+**Root Cause**: When scale is near zero (constant features), replacing it with 1.0 means:
+- Normalization: `(x - min) / 1.0 = x - min`
+- Denormalization: `x_norm * 1.0 + min = x_norm + min`
+- These don't properly invert each other
+
+**Fix**: Use epsilon additive (standard practice)
+```python
+# FIXED - _normalize_features:
+eps = 1e-8
+scale = max_val - min_val
+x_normalized = (x - min_val) / (scale + eps)  # Safe division
+
+# FIXED - _denormalize_features:
+eps = 1e-8
+scale = max_val - min_val
+return x_normalized * (scale + eps) + min_val  # Matches normalization
+```
+
+**Impact**: Without this fix:
+- Edge cases (constant features) produce incorrect values
+- Normalization and denormalization aren't proper inverses
+- Potential for numerical instability in training
+
+---
+
+### Issue #4: Missing Input Validation (LOW - DEFENSIVE)
+
+**Location**: Line 74 (ChiralHingeExchange.forward())
+**Severity**: LOW - Defensive programming
+
+**Problem**: No shape validation before tensor operations
+```python
+# CURRENT (NO VALIDATION):
+def forward(
+    self,
+    x_upper: torch.Tensor,
+    x_lower: torch.Tensor
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """..."""
+    # Transform flows for cross-pollination
+    lower_transformed = self.transform_lower_for_upper(x_lower)
+    # ... (no shape checks)
+```
+
+**Fix**: Add shape assertion
+```python
+# FIXED:
+def forward(
+    self,
+    x_upper: torch.Tensor,
+    x_lower: torch.Tensor
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """..."""
+    # FIX #4: Input validation - ensure shape compatibility
+    assert x_upper.shape == x_lower.shape, \
+        f"Shape mismatch in hinge exchange: upper {x_upper.shape} vs lower {x_lower.shape}"
+
+    # Transform flows for cross-pollination
+    lower_transformed = self.transform_lower_for_upper(x_lower)
+```
+
+**Impact**: Provides better error messages when size alignment fails
+
+---
+
+## Validation Required
+
+After applying fixes:
+
+1. **Existing Tests**: All 27 causal dataset tests must still pass
+   ```bash
+   pytest tests/data/test_causal_dataset.py -v
+   ```
+
+2. **Gradient Flow**: Verify gradients propagate through all 6 levels
+   - Create simple test with backward pass
+   - Check gradient norms at each level > 1e-6
+
+3. **Numerical Stability**: Test normalization edge cases
+   - All constant features (zero scale)
+   - Very small scale values
+
+4. **Size Alignment**: Verify unpooling correctness
+   - Check values are placed at `perm_large` positions
+   - Non-selected positions should be zero
+
+## Implementation Notes
+
+### Fix Order (Recommended)
+1. **Fix #1 first** - Most critical for training
+2. **Fix #2 second** - Important for information preservation
+3. **Fix #3 third** - Edge case handling
+4. **Fix #4 last** - Defensive programming
+
+### Testing Strategy
+- Apply all fixes in single commit
+- Run full test suite
+- Add gradient flow validation test
+- Verify no performance regression
+
+## Commit Message Template
+
+```
+Fix critical gradient flow and numerical issues in chiral architecture
+
+Addresses 3 critical issues identified in PR #17 code review:
+
+1. Gradient Flow (CRITICAL): Replace in-place tensor assignments with
+   scatter_ operations that maintain computational graph (lines 594-598)
+
+2. Size Alignment (HIGH): Implement proper unpooling using perm_large
+   instead of lossy nearest neighbor interpolation (lines 451-456)
+
+3. Numerical Stability (MEDIUM): Use epsilon additive in normalization
+   instead of conditional replacement (lines 395-398, 419-420)
+
+4. Input Validation (LOW): Add shape assertions in ChiralHingeExchange
+
+Testing:
+- All 27 causal dataset tests passing
+- Gradient flow validated through all 6 levels
+- Numerical stability verified with edge cases
+- No performance regression
+
+Fixes issues blocking PR #17 merge.
+Review: https://github.com/[repo]/pull/17#issuecomment-[id]
+
+🤖 Generated with [Claude Code](https://claude.com/claude-code)
+via [Happy](https://happy.engineering)
+
+Co-Authored-By: Claude <noreply@anthropic.com>
+Co-Authored-By: Happy <yesreply@happy.engineering>
+```
+
+## Estimated Time
+
+- Implementation: 30-45 minutes
+- Testing: 15-30 minutes
+- Total: 1-1.5 hours
+
+## Success Criteria
+
+- ✅ All 3 critical issues fixed
+- ✅ All 27 existing tests passing
+- ✅ Gradient flow validated
+- ✅ No performance regression
+- ✅ Code review concerns addressed
+
+## References
+
+- **PR #17**: https://github.com/[repo]/pull/17
+- **Review Comment**: Detailed analysis with code snippets
+- **Chiral Architecture**: `notes/FULL_CHIRAL_6LEVEL.md`
+- **Original Issue**: NSM-32 (6-level chiral dual-trifold)
diff --git a/TERMINOLOGY_UPDATES.md b/TERMINOLOGY_UPDATES.md
new file mode 100644
index 0000000..573ff61
--- /dev/null
+++ b/TERMINOLOGY_UPDATES.md
@@ -0,0 +1,307 @@
+# Terminology Updates (Post Peer-Review)
+
+**Date**: 2025-10-23
+**Context**: Addressing peer-review feedback on NSM-33 physics metrics
+**Status**: Applied to main codebase
+
+---
+
+## Summary of Changes
+
+Following comprehensive peer review, we've updated terminology throughout the codebase to accurately reflect the nature of our physics-inspired metrics. The key change: acknowledging these are **empirical heuristics** inspired by physical systems, not rigorous mathematical isomorphisms.
+
+## Key Terminology Changes
+
+### 1. "Isomorphism" → "Empirical Heuristic"
+
+**Rationale**: Peer review (research-assistant) identified that dimensional analysis fails for our physics metrics. True isomorphisms require:
+- Dimensional consistency
+- Coordinate invariance
+- Preservation of mathematical structure
+
+Our metrics lack these properties - they're **useful predictive tools** but not formal mappings.
+
+**Files Updated**:
+- `analysis/README_ISOMORPHISMS.md` → Title updated, disclaimer added
+- `nsm/training/physics_metrics.py` → Module docstring clarified
+
+**Pattern Applied**:
+```markdown
+# Before
+"Physics Isomorphisms for Neural Collapse Prediction"
+"Implements fusion-plasma isomorphism metrics"
+
+# After
+"Physics-Inspired Empirical Heuristics for Neural Collapse Prediction"
+"Implements fusion-plasma-inspired metrics"
+
+# With Disclaimer
+"**Note**: These are empirical heuristics (not rigorous isomorphisms) inspired by structural
+similarities to fusion plasma systems. Dimensional analysis reveals they lack true physical
+correspondence, but remain useful predictive tools validated through NSM-33 experiments."
+```
+
+### 2. "Temperature" → "Representation Variance" (Outside Fusion Context)
+
+**Rationale**: "Temperature" in our context means statistical variance/entropy of neural representations, NOT thermal temperature (kinetic energy). The fusion analogy remains valid only when explicitly acknowledged.
+
+**Files Updated**:
+- `nsm/training/physics_metrics.py` → Function `compute_temperature_profile()` docstring
+- `analysis/README_ISOMORPHISMS.md` → Section headings
+- `results/NSM-33_10x_validation_results.md` → Metric labels
+
+**Pattern Applied**:
+```python
+# Function still named compute_temperature_profile() for backwards compatibility
+# But docstring clarifies:
+
+"""
+Compute representation variance profile at each hierarchical level.
+
+**Note**: "Temperature" here refers to representation variance/entropy, NOT thermal
+temperature. The term is borrowed from fusion physics by analogy but represents a
+fundamentally different quantity (statistical dispersion, not kinetic energy).
+
+In the fusion analogy: temperature profiles T(r) determine confinement quality.
+In neural networks: representation variance serves structurally analogous role:
+    - High variance: Diverse, information-rich representations
+    - Low variance: Collapsed, uniform representations
+    - Inverted profile (variance decreasing with abstraction): Instability indicator
+"""
+```
+
+**Variable Names** (retain T_ prefix for brevity, clarify in documentation):
+- `T_L1`, `T_L2`, `T_L3` → Keep, but document as variance
+- `T_gradient` → Keep, but clarify as "variance gradient"
+- Display labels → Changed to "Representation Variance Profile"
+
+### 3. "Physics Metrics" → "Empirical Stability Metrics"
+
+**Context-Dependent**:
+- **Keep "Physics Metrics"** in technical documentation where fusion analogy is explicit
+- **Use "Empirical Stability Metrics"** in results/user-facing docs for clarity
+
+**Example from NSM-33 Results**:
+```markdown
+# Before
+**Physics Metrics (Final Epoch)**:
+- **Temperature Profile**: T_L1=0.381, T_L2=3.268, T_L3=13.590
+
+# After
+**Empirical Stability Metrics (Final Epoch)**:
+- **Representation Variance Profile**: T_L1=0.381, T_L2=3.268, T_L3=13.590
+  - Note: "T" denotes variance/entropy, not thermal temperature
+```
+
+---
+
+## What We DIDN'T Change
+
+### Preserved Terminology (With Context)
+
+1. **Variable names** (`T_L1`, `q_neural`, `Q_factor`) - Backwards compatibility
+2. **Function names** (`compute_temperature_profile`) - API stability
+3. **Fusion references** - When explicitly discussing the analogy
+4. **Module names** (`physics_metrics.py`) - Established convention
+
+### Fusion Context (Terminology OK)
+
+When discussing the **fusion plasma analogy explicitly**, original terminology is appropriate:
+
+```python
+# In physics_metrics.py docstring:
+"""
+Mathematical parallels (structural, not isomorphic):
+- Neural class collapse ↔ Plasma confinement loss
+- α/β hinge parameters ↔ α/β fusion parameters
+- Representation variance ↔ Temperature in fusion systems
+
+References:
+- Lawson, J.D. (1957). "Some Criteria for a Power Producing Thermonuclear Reactor"
+- Wesson, J. (2011). "Tokamak Physics" (safety factor q)
+"""
+```
+
+Here "temperature" refers to the fusion system, so no change needed.
+
+---
+
+## Documentation Added
+
+### New File: `docs/diversity_regularization.md`
+
+Comprehensive documentation of the diversity regularization mechanism, including:
+- Mathematical formulation
+- Implementation details
+- Hyperparameter tuning
+- NSM-33 results analysis
+- Theoretical justification (information bottleneck)
+- Peer review concerns (confounds, causation)
+- Recommended ablation studies
+
+**Key Addition**: Explicit discussion of reviewer's critique that high variance may indicate instability, not health.
+
+---
+
+## Files Modified
+
+### Core Changes
+1. **nsm/training/physics_metrics.py** (lines 1-22, 106-157)
+   - Module docstring: Clarified heuristic nature
+   - Function docstring: Explained T = variance, not thermal
+   - Comments: Replaced "temperature" with "variance" in implementation
+
+2. **analysis/README_ISOMORPHISMS.md** (lines 1-62)
+   - Title: "Physics-Inspired Empirical Heuristics..."
+   - Added terminology disclaimer paragraph
+   - Updated section headings
+
+3. **results/NSM-33_10x_validation_results.md** (lines 11-62)
+   - Executive summary: Added terminology note
+   - Metric labels: "Empirical Stability Metrics"
+   - Profile labels: "Representation Variance Profile"
+   - Added "(NOT thermal temperature)" clarifications
+
+### New Files
+4. **docs/diversity_regularization.md** (250 lines)
+   - Complete mechanism documentation
+   - Addresses peer review concerns
+   - Includes alternative interpretations
+
+5. **TERMINOLOGY_UPDATES.md** (this file)
+   - Change log and rationale
+
+---
+
+## Rationale from Peer Review
+
+### Dimensional Analysis Failure
+
+**Reviewer's Critique**:
+> "Dimensional analysis fails: In tokamak physics, q has dimensions [dimensionless] from ratio of magnetic field ratios. Your q_neural combines arbitrary units from gradient norms and class balances. Cannot compare across models/scales."
+
+**Response**: Acknowledged. Changed "isomorphism" to "heuristic" throughout.
+
+### Temperature Interpretation
+
+**Reviewer's Critique**:
+> "High variance in L3 might indicate insufficient training (representations not converged), regularization preventing compression, or fighting against natural information bottleneck."
+
+**Counter-evidence from NSM-33**:
+- Fixed architecture has WORSE class balance (11.48% vs 5.91%)
+- Fixed architecture has LOWER q_neural (0.625 vs 1.336)
+- Scale alone achieves better results
+
+**Conclusion**: Effect is CONFOUNDED - scale dominates diversity regularization.
+
+**Action Taken**:
+- Updated diversity_regularization.md with alternative interpretation
+- Clarified "temperature" = variance (not claiming thermal correspondence)
+- Recommended ablation at fixed scale to isolate effect
+
+---
+
+## Impact on Codebase
+
+### Backwards Compatibility
+✅ **Preserved**: All APIs, function signatures, variable names
+- `compute_temperature_profile()` - function name unchanged
+- `T_L1`, `T_L2`, `T_L3` - variable names unchanged
+- `q_neural`, `Q_factor` - metric names unchanged
+
+### User-Facing Changes
+⚠️ **Updated**: Documentation, comments, docstrings
+- Users will see clarified terminology in help text
+- Results reports use "Empirical Stability Metrics"
+- No code changes required for existing usage
+
+### Semantic Changes
+🔄 **Clarified**: Interpretation, not measurement
+- Metrics compute the same values
+- Interpretation is more accurate
+- Claims are more modest
+
+---
+
+## Future Work
+
+### Theoretical Strengthening (From Peer Review)
+
+1. **Information-theoretic reformulation**:
+   ```python
+   # Replace variance with mutual information
+   T_Lk = I(X_Lk; Y)  # Information about labels
+
+   # From literature: Tishby & Zaslavsky (2015)
+   # Predicts: I decreases with depth (compression)
+   ```
+
+2. **PAC learning bounds** for split ratios:
+   ```python
+   def compute_min_val_size(
+       vc_dimension: int,
+       error_bound: float = 0.05,
+       confidence: float = 0.95
+   ) -> int:
+       """Derive from Vapnik (1998), not 'industry standard'"""
+       delta = 1 - confidence
+       return int((vc_dimension / error_bound**2) * (np.log(1/delta) + np.log(2)))
+   ```
+
+3. **Multi-seed validation**: Run 5 seeds, report mean ± std, significance tests
+
+---
+
+## References
+
+### Peer Review Source
+- **research-assistant** comprehensive review (2025-10-23)
+- Grade: B+ (Strong execution, moderate theoretical rigor)
+- Key feedback: "Physics isomorphism overclaimed - dimensional analysis fails"
+
+### Literature Cited in Updates
+- **Tishby & Zaslavsky (2015)**: Information Bottleneck Principle
+- **Vapnik (1998)**: Statistical Learning Theory (PAC bounds)
+- **Shwartz-Ziv & Tishby (2017)**: Opening Black Box of DNNs
+
+---
+
+## Commit Message Template
+
+```
+Update terminology: physics isomorphisms → empirical heuristics
+
+Address peer review feedback on NSM-33 physics metrics:
+- Clarify "isomorphisms" are empirical heuristics (not rigorous)
+- Document "temperature" means variance/entropy (not thermal)
+- Add diversity regularization mechanism documentation
+- Preserve backwards compatibility (APIs unchanged)
+
+Files modified:
+- analysis/README_ISOMORPHISMS.md
+- nsm/training/physics_metrics.py
+- results/NSM-33_10x_validation_results.md
+- docs/diversity_regularization.md (NEW)
+
+Rationale: Dimensional analysis reveals metrics lack invariance
+properties required for true physical analogies. Remain useful
+predictive tools validated through experiment.
+
+🤖 Generated with [Claude Code](https://claude.com/claude-code)
+
+Co-Authored-By: Claude <noreply@anthropic.com>
+```
+
+---
+
+## Status
+
+✅ **Completed**: Terminology updates applied
+🚧 **In Progress**: Multi-seed validation experiments (5 seeds × 3 conditions)
+📋 **TODO**: Statistical significance analysis with confidence intervals
+
+**Next Steps**:
+1. Wait for multi-seed experiments to complete
+2. Analyze results with proper significance testing
+3. Create PR with terminology updates + multi-seed results
+4. Address remaining peer review feedback (PAC bounds, information theory)
diff --git a/TWO_WEEK_SPRINT_PLAN.md b/TWO_WEEK_SPRINT_PLAN.md
new file mode 100644
index 0000000..57dad03
--- /dev/null
+++ b/TWO_WEEK_SPRINT_PLAN.md
@@ -0,0 +1,992 @@
+# Two-Week Sprint Plan: External Review Readiness
+
+**Goal**: Transform NSM from "early prototype" to "share-ready research demo"
+**Timeline**: 14 days
+**Estimated Effort**: 1 person, full-time
+**Modal Compute Budget**: ~$100
+
+---
+
+## Week 1: Scientific Validation (Days 1-7)
+
+### Day 1-2: Fix Multi-Seed Validation ⚠️ CRITICAL
+
+**Problem**: Only seed 42 completed successfully (66.43%). Cannot claim results without statistical significance.
+
+**Tasks**:
+1. Debug why seeds 123, 456, 789, 1011 failed
+   - Check Modal logs for timeout vs. crash vs. OOM
+   - Likely issue: Dataset size variation or batch size
+   - Fix: Add error handling, reduce batch size if needed
+
+2. Run 3-seed minimum validation
+   ```bash
+   # Sequential to debug issues
+   modal run experiments/modal_10x_baseline.py --seed 42  # Already done
+   modal run experiments/modal_10x_baseline.py --seed 123
+   modal run experiments/modal_10x_baseline.py --seed 456
+   ```
+
+3. Create results aggregation script
+   ```python
+   # scripts/aggregate_multi_seed.py
+   import json
+   import numpy as np
+   from pathlib import Path
+
+   def aggregate_results():
+       results = []
+       for seed in [42, 123, 456]:
+           path = f"checkpoints/nsm-10x-baseline-seed{seed}_results.json"
+           if Path(path).exists():
+               with open(path) as f:
+                   results.append(json.load(f))
+
+       accuracies = [r['best_val_accuracy'] for r in results]
+       print(f"Mean: {np.mean(accuracies):.4f}")
+       print(f"Std:  {np.std(accuracies):.4f}")
+       print(f"Results: {accuracies}")
+   ```
+
+**Deliverable**:
+- `MULTI_SEED_RESULTS.md` with table:
+  ```
+  | Seed | Best Epoch | Val Accuracy | q_neural | Notes |
+  |------|------------|--------------|----------|-------|
+  | 42   | 11         | 66.43%       | 0.472    | ✓     |
+  | 123  | ?          | ?            | ?        | ?     |
+  | 456  | ?          | ?            | ?        | ?     |
+  | Mean | -          | XX.XX ± Y.YY | -        | -     |
+  ```
+
+**Success Criterion**: ≥3 seeds complete with std < 5%
+
+**Time**: 16 hours (2 days x 8 hours)
+**Cost**: ~$30 Modal credits (3 full training runs)
+
+---
+
+### Day 3-4: Implement Baseline Comparisons ⚠️ CRITICAL
+
+**Problem**: 66% accuracy is meaningless without context. Simple baseline might beat us.
+
+**Tasks**:
+1. Implement 3 baselines in `experiments/baselines.py`
+
+   **Baseline 1: Vanilla RGCN (No Hierarchy)**
+   ```python
+   class SimpleRGCN(nn.Module):
+       """Just message passing + pooling, no WHY/WHAT operations"""
+       def __init__(self, node_features, num_relations, num_classes):
+           super().__init__()
+           self.conv1 = RGCNConv(node_features, 128, num_relations)
+           self.conv2 = RGCNConv(128, 64, num_relations)
+           self.fc = nn.Linear(64, num_classes)
+
+       def forward(self, x, edge_index, edge_type, batch):
+           x = F.relu(self.conv1(x, edge_index, edge_type))
+           x = F.relu(self.conv2(x, edge_index, edge_type))
+           x = global_mean_pool(x, batch)
+           return self.fc(x)
+   ```
+
+   **Baseline 2: Graph Mean Pooling + MLP**
+   ```python
+   class GraphMLP(nn.Module):
+       """Simplest possible: average node features → MLP"""
+       def __init__(self, node_features, num_classes):
+           super().__init__()
+           self.fc1 = nn.Linear(node_features, 128)
+           self.fc2 = nn.Linear(128, 64)
+           self.fc3 = nn.Linear(64, num_classes)
+
+       def forward(self, x, edge_index, edge_type, batch):
+           x = global_mean_pool(x, batch)  # Ignore graph structure!
+           x = F.relu(self.fc1(x))
+           x = F.relu(self.fc2(x))
+           return self.fc3(x)
+   ```
+
+   **Baseline 3: Standard GCN (Untyped Edges)**
+   ```python
+   class SimpleGCN(nn.Module):
+       """GCN ignoring edge types"""
+       def __init__(self, node_features, num_classes):
+           super().__init__()
+           self.conv1 = GCNConv(node_features, 128)
+           self.conv2 = GCNConv(128, 64)
+           self.fc = nn.Linear(64, num_classes)
+
+       def forward(self, x, edge_index, edge_type, batch):
+           x = F.relu(self.conv1(x, edge_index))
+           x = F.relu(self.conv2(x, edge_index))
+           x = global_mean_pool(x, batch)
+           return self.fc(x)
+   ```
+
+2. Train each baseline (1 seed is enough for comparison)
+   ```bash
+   modal run experiments/baselines.py::train_simple_rgcn --seed 42
+   modal run experiments/baselines.py::train_graph_mlp --seed 42
+   modal run experiments/baselines.py::train_simple_gcn --seed 42
+   ```
+
+3. Compare parameter counts
+   ```python
+   def count_parameters(model):
+       return sum(p.numel() for p in model.parameters())
+
+   # NSM 6-level: 173,374 parameters
+   # Simple RGCN: ~XX,XXX parameters (probably less)
+   # Graph MLP: ~XX,XXX parameters
+   ```
+
+**Deliverable**:
+- `BASELINE_COMPARISON.md` with table:
+  ```
+  | Model          | Params  | Accuracy | Advantage | Notes             |
+  |----------------|---------|----------|-----------|-------------------|
+  | Graph MLP      | ~50K    | XX.X%    | -        | No structure      |
+  | Simple GCN     | ~80K    | XX.X%    | -        | No edge types     |
+  | Simple RGCN    | ~120K   | XX.X%    | -        | No hierarchy      |
+  | NSM 6-level    | 173K    | 66.4%    | +X.X%    | Ours (p<0.05?)    |
+  ```
+
+**Success Criterion**: NSM beats all baselines by ≥2% (statistically significant)
+
+**Risk**: If baselines win, need to understand why and pivot framing
+
+**Time**: 16 hours (debugging, training, analysis)
+**Cost**: ~$20 Modal credits (3 baseline runs)
+
+---
+
+### Day 5-7: Create Interpretability Demonstrations ⚠️ CRITICAL
+
+**Problem**: Core claim is "interpretable reasoning" but zero visualizations exist.
+
+**Tasks**:
+
+**Day 5: Extract Reasoning Traces**
+
+1. Create trace extraction script
+   ```python
+   # scripts/extract_reasoning_trace.py
+   import torch
+   from nsm.models.chiral import FullChiralModel
+   from nsm.utils.checkpoint_manager import load_nsm_checkpoint
+   import networkx as nx
+   import matplotlib.pyplot as plt
+
+   def extract_trace(model, graph, max_nodes=20):
+       """Extract hierarchical reasoning trace from input to prediction"""
+       model.eval()
+
+       with torch.no_grad():
+           # Forward pass through all 6 levels
+           x_l1 = model.left_trifold.level1(graph.x, graph.edge_index, graph.edge_type)
+           x_l2 = model.left_trifold.level2(x_l1, ...)
+           x_l3 = model.left_trifold.level3(x_l2, ...)
+
+           # Pool to get representative nodes at each level
+           top_nodes_l1 = torch.topk(x_l1.norm(dim=1), k=min(10, len(x_l1))).indices
+           top_nodes_l2 = torch.topk(x_l2.norm(dim=1), k=min(5, len(x_l2))).indices
+           # ... etc
+
+           return {
+               'level_1_nodes': top_nodes_l1,
+               'level_2_nodes': top_nodes_l2,
+               'level_3_nodes': top_nodes_l3,
+               'prediction': model(graph.x, graph.edge_index, graph.edge_type, graph.batch),
+               'attention_weights': ...,  # If available
+           }
+   ```
+
+2. Create 5 example traces from validation set
+   - 2 correct predictions (high confidence)
+   - 2 correct predictions (low confidence)
+   - 1 incorrect prediction (for honesty)
+
+**Day 6: Visualize Hierarchical Structure**
+
+3. Create visualization script
+   ```python
+   # scripts/visualize_trace.py
+   def visualize_reasoning_trace(trace, save_path):
+       """Create multi-level graph visualization"""
+       fig, axes = plt.subplots(2, 3, figsize=(18, 12))
+
+       # Level 1: Environment/Perception (bottom)
+       plot_graph_level(axes[1, 0], trace['level_1'], title="L1: Actions/Environment")
+
+       # Level 2: Actions/Behaviors
+       plot_graph_level(axes[1, 1], trace['level_2'], title="L2: Actions")
+
+       # ... up to Level 6: Purpose/Values
+       plot_graph_level(axes[0, 2], trace['level_6'], title="L6: Purpose/Values")
+
+       # Add arrows showing WHY/WHAT flow
+       add_flow_arrows(axes)
+
+       plt.savefig(save_path, dpi=300, bbox_inches='tight')
+   ```
+
+4. Generate visualizations for all 5 examples
+   - Save as: `results/trace_example_{1-5}.png`
+
+**Day 7: Create Narrative Walkthrough**
+
+5. Write detailed interpretation for each example
+   ```markdown
+   # Example 1: Correct High-Confidence Prediction
+
+   **Input**: Graph with 47 nodes representing planning state
+   **Prediction**: Class 1 (confidence: 0.94)
+   **Ground Truth**: Class 1 ✓
+
+   ## Reasoning Trace
+
+   ### Level 6 (Purpose): Top 3 Activated Nodes
+   - Node 42: "Goal Achievement" (activation: 0.87)
+   - Node 15: "Resource Optimization" (activation: 0.62)
+   - Node 33: "Constraint Satisfaction" (activation: 0.54)
+
+   **Interpretation**: Model identifies this as goal-oriented planning
+
+   ### WHY→WHAT Flow
+   L6 "Goal Achievement" abstracts to...
+   → L5 "Sequential Planning" which decomposes to...
+   → L4 "Action Sequencing" which implements as...
+   → L3 "Resource Allocation" which executes as...
+   → L2 "Primitive Actions" which observes...
+   → L1 "Environmental State"
+
+   ### Key Insight
+   The model correctly identifies the hierarchical structure of the planning
+   problem. High confidence stems from consistent activation across all levels.
+   ```
+
+6. Create `INTERPRETABILITY_DEMO.md` with all 5 examples
+
+**Deliverable**:
+- 5 visualization PNGs showing hierarchical reasoning
+- `INTERPRETABILITY_DEMO.md` with narratives
+- Script that others can run: `python scripts/visualize_trace.py --checkpoint checkpoints/nsm-10x-baseline_best.pt --example 1`
+
+**Success Criterion**: Someone with ML background can look at visualizations and understand what the model is doing
+
+**Time**: 24 hours (3 days x 8 hours)
+**Cost**: $0 (inference only)
+
+---
+
+## Week 2: Documentation & Packaging (Days 8-14)
+
+### Day 8-9: Update Documentation
+
+**Problem**: Documentation contradicts reality (Phase 1 vs. NSM-33 work)
+
+**Tasks**:
+
+1. **Update README.md**
+   ```markdown
+   # Neural Symbolic Model (NSM)
+
+   Hierarchical graph neural network with interpretable reasoning via symmetric
+   abstraction/concretization operations.
+
+   ## Current Status: NSM-33 Validation Complete ✓
+
+   - **Best Accuracy**: 66.43 ± X.XX% (3-seed validation)
+   - **Architecture**: 6-level chiral dual-trifold (173K parameters)
+   - **Dataset**: Planning task with 20K training samples
+   - **Novel Contribution**: Physics-inspired training stability metrics
+
+   ## Quick Start
+
+   ### Installation
+   ```bash
+   pip install torch==2.1.0 torch-geometric==2.4.0
+   git clone https://github.com/research-developer/nsm.git
+   cd nsm
+   pip install -e .
+   ```
+
+   ### Run Demo
+   ```bash
+   # Visualize reasoning trace on example
+   python scripts/demo.py --example 1
+
+   # Train from scratch (requires Modal account)
+   modal run experiments/modal_10x_baseline.py
+   ```
+
+   ## Architecture Overview
+
+   [Insert simple diagram showing 6 levels with WHY/WHAT arrows]
+
+   ## Key Results
+
+   | Model          | Accuracy | Params | Interpretable |
+   |----------------|----------|--------|---------------|
+   | Simple RGCN    | XX.X%    | 120K   | ✗             |
+   | NSM 6-level    | 66.4%    | 173K   | ✓             |
+   | Improvement    | +X.X%    | -      | Unique        |
+
+   ## Novel Contributions
+
+   1. **Symmetric Hierarchical Operations**: WHY/WHAT as category-theoretic adjoints
+   2. **Physics-Inspired Metrics**: Borrowed from plasma fusion (q_neural safety factor)
+   3. **Interpretable Reasoning**: Explicit traces through 6-level hierarchy
+
+   ## Documentation
+
+   - [Two-Week Sprint Results](TWO_WEEK_SPRINT_RESULTS.md)
+   - [Interpretability Demo](INTERPRETABILITY_DEMO.md)
+   - [Baseline Comparisons](BASELINE_COMPARISON.md)
+   - [Multi-Seed Validation](MULTI_SEED_RESULTS.md)
+
+   ## Project History
+
+   - **NSM-32**: 6-level architecture development
+   - **NSM-33**: 10x dataset scaling, physics metrics (85.7% collapse prediction)
+   - **NSM-34**: Checkpoint infrastructure, CGT investigation (negative result)
+
+   ## Citation
+
+   If you use this work, please cite:
+   ```bibtex
+   @software{nsm2025,
+     title={Neural Symbolic Model: Interpretable Hierarchical Reasoning},
+     author={[Your Name]},
+     year={2025},
+     url={https://github.com/research-developer/nsm}
+   }
+   ```
+   ```
+
+2. **Update CLAUDE.md** to match current state
+   - Change "Phase 1: 2-level hierarchy" → "Phase 1.5: 6-level validation"
+   - Add NSM-33 and NSM-34 to timeline
+   - Document CGT investigation as completed (negative result)
+
+3. **Create FAQ.md** for anticipated questions
+   ```markdown
+   # Frequently Asked Questions
+
+   ## Why not use transformers?
+
+   Transformers lack explicit hierarchical structure and interpretable reasoning
+   traces. NSM provides provable symmetry (WHY∘WHAT ≈ id) via category theory.
+
+   ## What's the "planning task"?
+
+   Binary classification of planning problem instances from [dataset paper].
+   Task: Predict if plan will succeed given initial state and constraints.
+   Random baseline: 50%, Simple RGCN: XX%, NSM: 66.4%
+
+   ## How do you ensure interpretability?
+
+   Every prediction traces through 6 levels with explicit node activations.
+   See INTERPRETABILITY_DEMO.md for 5 concrete examples.
+
+   ## What are "physics-inspired metrics"?
+
+   We borrowed q_neural (safety factor) from plasma fusion physics to predict
+   training collapse. Achieved 85.7% accuracy in NSM-33 validation.
+
+   ## What didn't work?
+
+   Combinatorial Game Theory operators (NSM-34). Conway temperature was
+   invariant (0.0000) across all epochs. Root cause: implementation flaw
+   (deterministic operations). See PR #12 for details.
+
+   ## Is this production-ready?
+
+   No. This is a research prototype demonstrating novel ideas. Not optimized
+   for deployment.
+   ```
+
+4. **Document Planning Dataset**
+   ```markdown
+   # Dataset Description
+
+   ## Planning Triple Dataset
+
+   **Source**: Synthetic generation based on PDDL-like planning formalism
+   **Task**: Binary classification (plan feasible vs. infeasible)
+   **Size**:
+   - Training: 16,000 problems (80%)
+   - Validation: 4,000 problems (20%)
+
+   ## Graph Structure
+
+   **Nodes**: Represent states, actions, and goals (avg: 47 nodes/graph)
+   **Edges**: Typed relations (17 types):
+   - precondition, effect, requires, enables, conflicts, ...
+   **Node Features**: 64-dim learned embeddings
+
+   ## Task Difficulty
+
+   **Random Baseline**: 50% (balanced classes)
+   **Simple MLP**: ~XX% (ignoring graph structure)
+   **Simple RGCN**: ~XX% (no hierarchy)
+   **NSM 6-level**: 66.4% (interpretable)
+
+   ## Example Problem
+
+   [Add simple visualization of one problem]
+   ```
+
+**Deliverable**:
+- Updated README.md reflecting current state
+- Updated CLAUDE.md matching reality
+- FAQ.md addressing anticipated questions
+- DATASET.md describing task clearly
+
+**Time**: 16 hours (2 days x 8 hours)
+
+---
+
+### Day 10: Create Two-Page Summary
+
+**Problem**: Need concise overview for busy researchers
+
+**Tasks**:
+
+1. Write `NSM_RESEARCH_SUMMARY.pdf` (2 pages max)
+
+   **Page 1: Overview + Architecture**
+   ```
+   [Title] Neural Symbolic Model: Interpretable Hierarchical Reasoning
+
+   [Abstract - 100 words]
+   We present NSM, a 6-level graph neural network where abstraction (WHY)
+   and concretization (WHAT) are symmetric operations proven via category
+   theory. Novel physics-inspired metrics predict training collapse with
+   85% accuracy. Achieves 66.4% accuracy on planning tasks with full
+   interpretability - every prediction traces through explicit reasoning
+   hierarchy.
+
+   [Diagram: 6-level architecture with WHY/WHAT arrows]
+
+   [Key Innovation bullets]
+   - Symmetric hierarchical operations (adjoint functors)
+   - Physics-inspired stability monitoring (q_neural from fusion)
+   - Explicit interpretable reasoning traces
+
+   [Results Table]
+   | Model       | Acc   | Params | Interp |
+   |-------------|-------|--------|--------|
+   | Simple RGCN | XX.X% | 120K   | ✗      |
+   | NSM 6-level | 66.4% | 173K   | ✓      |
+   ```
+
+   **Page 2: Results + Next Steps**
+   ```
+   [Figure: Example reasoning trace visualization]
+
+   [Multi-Seed Results]
+   Seed 42: 66.43%, Seed 123: XX.X%, Seed 456: XX.X%
+   Mean: XX.XX ± Y.YY% (statistically significant improvement)
+
+   [Interpretability Example - 50 words]
+   Model identifies hierarchical structure: L6 "Goal Achievement" →
+   L5 "Sequential Planning" → ... → L1 "Environmental State".
+   High confidence stems from consistent activation across levels.
+
+   [Limitations]
+   - Synthetic dataset (not real-world planning)
+   - Modest absolute accuracy (66% vs. potential 100%)
+   - Requires PyTorch Geometric (deployment friction)
+
+   [Next Steps]
+   - Real-world benchmark evaluation
+   - Scaling to larger models
+   - Application to code reasoning / robotics planning
+
+   [Contact]: [Your Email]
+   [Code]: github.com/research-developer/nsm
+   ```
+
+**Deliverable**: `NSM_RESEARCH_SUMMARY.pdf` (2 pages, figures included)
+
+**Time**: 8 hours
+
+---
+
+### Day 11-12: Build Standalone Demo Script
+
+**Problem**: External reviewers can't easily run Modal-based experiments
+
+**Tasks**:
+
+1. Create `scripts/standalone_demo.py`
+   ```python
+   #!/usr/bin/env python3
+   """
+   Standalone NSM Demo - No Modal Required
+
+   Downloads pre-trained checkpoint and runs inference on example graphs.
+   Shows interpretable reasoning traces.
+
+   Usage:
+       python scripts/standalone_demo.py --example 1
+       python scripts/standalone_demo.py --interactive
+   """
+
+   import torch
+   import requests
+   from pathlib import Path
+   import matplotlib.pyplot as plt
+
+   def download_checkpoint(url, path):
+       """Download pre-trained checkpoint from GitHub releases"""
+       if not Path(path).exists():
+           print(f"Downloading checkpoint from {url}...")
+           response = requests.get(url)
+           Path(path).parent.mkdir(parents=True, exist_ok=True)
+           with open(path, 'wb') as f:
+               f.write(response.content)
+           print("✓ Download complete")
+
+   def load_model(checkpoint_path):
+       """Load pre-trained NSM model"""
+       from nsm.models.chiral import FullChiralModel
+
+       checkpoint = torch.load(checkpoint_path, map_location='cpu')
+       model = FullChiralModel(
+           node_features=64,
+           num_relations=17,
+           num_classes=2,
+           pool_ratio=0.5,
+           task_type='classification',
+           dropout=0.1
+       )
+       model.load_state_dict(checkpoint['model_state_dict'])
+       model.eval()
+       return model
+
+   def run_example(model, example_id):
+       """Run inference on pre-loaded example"""
+       # Load example from data/examples/
+       graph = torch.load(f'data/examples/example_{example_id}.pt')
+
+       # Extract reasoning trace
+       trace = extract_trace(model, graph)
+
+       # Visualize
+       visualize_trace(trace, save_path=f'results/demo_trace_{example_id}.png')
+
+       # Print interpretation
+       print_interpretation(trace)
+
+   def interactive_mode(model):
+       """Interactive exploration of reasoning traces"""
+       print("Interactive NSM Demo")
+       print("Commands: example <N>, quit")
+
+       while True:
+           cmd = input("> ").strip()
+           if cmd == "quit":
+               break
+           elif cmd.startswith("example "):
+               example_id = int(cmd.split()[1])
+               run_example(model, example_id)
+
+   if __name__ == "__main__":
+       import argparse
+
+       parser = argparse.ArgumentParser()
+       parser.add_argument('--example', type=int, help='Run specific example')
+       parser.add_argument('--interactive', action='store_true')
+       parser.add_argument('--checkpoint', default='checkpoints/nsm-10x-baseline_best.pt')
+
+       args = parser.parse_args()
+
+       # Download checkpoint if needed (from GitHub releases)
+       CHECKPOINT_URL = "https://github.com/research-developer/nsm/releases/download/v0.1/nsm-10x-baseline_best.pt"
+       download_checkpoint(CHECKPOINT_URL, args.checkpoint)
+
+       # Load model
+       model = load_model(args.checkpoint)
+
+       if args.interactive:
+           interactive_mode(model)
+       elif args.example:
+           run_example(model, args.example)
+       else:
+           print("Running default examples...")
+           for i in range(1, 6):
+               run_example(model, i)
+   ```
+
+2. Package example graphs
+   ```bash
+   # Create data/examples/ with 5 pre-loaded graphs
+   python scripts/package_examples.py
+   ```
+
+3. Upload checkpoint to GitHub Release
+   ```bash
+   # Create v0.1 release with checkpoint file
+   gh release create v0.1 \
+     checkpoints/nsm-10x-baseline_best.pt \
+     --title "NSM v0.1 - Initial Release" \
+     --notes "Pre-trained 6-level model (66.4% accuracy)"
+   ```
+
+4. Test standalone script works
+   ```bash
+   # Fresh conda environment test
+   conda create -n nsm-test python=3.10
+   conda activate nsm-test
+   pip install torch==2.1.0 torch-geometric==2.4.0
+   python scripts/standalone_demo.py --example 1
+   # Should work without Modal
+   ```
+
+**Deliverable**:
+- `scripts/standalone_demo.py` (fully functional)
+- `data/examples/` with 5 pre-loaded graphs
+- GitHub release v0.1 with checkpoint
+- `STANDALONE_DEMO.md` with usage instructions
+
+**Time**: 16 hours (2 days x 8 hours)
+
+---
+
+### Day 13-14: Create Hero Figure & Final Package
+
+**Problem**: Need one compelling visual + polished presentation
+
+**Tasks**:
+
+**Day 13: Create Hero Figure**
+
+1. Design comprehensive figure showing:
+   - **Panel A**: 6-level architecture diagram
+   - **Panel B**: Example reasoning trace (one of our 5 examples)
+   - **Panel C**: Results comparison (NSM vs. baselines bar chart)
+   - **Panel D**: Physics metrics (q_neural over epochs, showing prediction)
+
+2. Create in presentation-quality tool
+   ```python
+   # scripts/create_hero_figure.py
+   import matplotlib.pyplot as plt
+   from matplotlib.gridspec import GridSpec
+
+   def create_hero_figure():
+       fig = plt.figure(figsize=(16, 10))
+       gs = GridSpec(2, 2, figure=fig)
+
+       # Panel A: Architecture
+       ax1 = fig.add_subplot(gs[0, 0])
+       plot_architecture_diagram(ax1)
+       ax1.set_title('A) NSM Architecture', fontsize=14, fontweight='bold')
+
+       # Panel B: Reasoning Trace
+       ax2 = fig.add_subplot(gs[0, 1])
+       plot_reasoning_trace(ax2, example_id=1)
+       ax2.set_title('B) Interpretable Reasoning', fontsize=14, fontweight='bold')
+
+       # Panel C: Results
+       ax3 = fig.add_subplot(gs[1, 0])
+       plot_results_comparison(ax3)
+       ax3.set_title('C) Benchmark Comparison', fontsize=14, fontweight='bold')
+
+       # Panel D: Physics Metrics
+       ax4 = fig.add_subplot(gs[1, 1])
+       plot_physics_metrics(ax4)
+       ax4.set_title('D) Training Stability Prediction', fontsize=14, fontweight='bold')
+
+       plt.tight_layout()
+       plt.savefig('results/NSM_HERO_FIGURE.png', dpi=300, bbox_inches='tight')
+       plt.savefig('results/NSM_HERO_FIGURE.pdf', bbox_inches='tight')
+   ```
+
+3. Generate figure
+   ```bash
+   python scripts/create_hero_figure.py
+   # Output: results/NSM_HERO_FIGURE.png (for slides)
+   # Output: results/NSM_HERO_FIGURE.pdf (for paper)
+   ```
+
+**Day 14: Final Packaging & QA**
+
+4. Create sprint completion checklist
+   ```markdown
+   # Sprint Completion Checklist
+
+   ## Week 1: Scientific Validation
+   - [ ] Multi-seed validation (≥3 seeds, std < 5%)
+   - [ ] Baseline comparisons (NSM beats all by ≥2%)
+   - [ ] 5 interpretability examples with visualizations
+
+   ## Week 2: Documentation
+   - [ ] README.md updated to match reality
+   - [ ] CLAUDE.md aligned with current state
+   - [ ] FAQ.md addresses common questions
+   - [ ] DATASET.md describes task clearly
+
+   ## Deliverables
+   - [ ] NSM_RESEARCH_SUMMARY.pdf (2 pages)
+   - [ ] Standalone demo script works without Modal
+   - [ ] Hero figure (PNG + PDF)
+   - [ ] All results documented in results/
+
+   ## GitHub
+   - [ ] Release v0.1 with checkpoint uploaded
+   - [ ] All markdown files committed
+   - [ ] Code is clean and commented
+
+   ## Ready to Share?
+   - [ ] Can answer "what problem does this solve?"
+   - [ ] Can defend accuracy claims with statistics
+   - [ ] Can show concrete interpretability example
+   - [ ] Can run demo for someone in <5 minutes
+   ```
+
+5. Run full quality check
+   ```bash
+   # Test all scripts work
+   python scripts/standalone_demo.py --example 1
+   python scripts/visualize_trace.py --checkpoint checkpoints/nsm-10x-baseline_best.pt
+   python scripts/aggregate_multi_seed.py
+
+   # Check documentation
+   grep -r "Phase 1: 2-level" .  # Should return nothing
+   grep -r "TODO" .  # Address any TODOs
+
+   # Verify results files exist
+   ls results/
+   # Should have:
+   # - NSM_HERO_FIGURE.png
+   # - NSM_HERO_FIGURE.pdf
+   # - trace_example_1.png (through 5)
+   # - MULTI_SEED_RESULTS.md
+   # - BASELINE_COMPARISON.md
+   # - INTERPRETABILITY_DEMO.md
+   ```
+
+6. Create final summary document
+   ```markdown
+   # Two-Week Sprint Results
+
+   **Dates**: [Start] - [End]
+   **Goal**: Make NSM share-ready for external review
+   **Status**: ✓ Complete
+
+   ## What We Accomplished
+
+   ### Scientific Rigor
+   ✅ Multi-seed validation (3 seeds, mean: XX.XX ± Y.YY%)
+   ✅ Baseline comparisons (NSM beats all by X.X%)
+   ✅ Interpretability demonstrations (5 concrete examples)
+   ✅ Task documentation (planning dataset fully described)
+
+   ### Documentation Quality
+   ✅ README matches current state
+   ✅ 2-page research summary created
+   ✅ FAQ addresses anticipated questions
+   ✅ Hero figure shows key contributions
+
+   ### Accessibility
+   ✅ Standalone demo script (no Modal required)
+   ✅ Pre-trained checkpoint on GitHub release
+   ✅ 5-minute demo workflow established
+
+   ## Key Results
+
+   [Insert hero figure]
+
+   **Main Finding**: NSM achieves 66.4 ± Y.Y% accuracy on planning task,
+   beating simple baselines by X.X% while providing full interpretability
+   via explicit 6-level reasoning traces.
+
+   **Novel Contribution**: Physics-inspired q_neural metric predicts training
+   collapse with 85.7% accuracy (NSM-33 validation).
+
+   **Honest Limitations**:
+   - Synthetic dataset (not real-world planning yet)
+   - Modest absolute accuracy (room for improvement)
+   - Requires PyTorch Geometric (deployment friction)
+
+   ## Ready to Share
+
+   **Recommended First Contact**: SSI or SoftMax (smaller orgs, early-stage work)
+
+   **Conversation Starter**:
+   > "We built hierarchical GNNs with symmetric abstraction/concretization
+   > (via category theory). Interesting bit: borrowed plasma physics metrics
+   > to predict training collapse (85% accuracy). Also tried game theory -
+   > total failure, but interesting failure. Would love your thoughts on
+   > [specific question relevant to their work]."
+
+   **Demo Flow** (5 minutes):
+   1. Show hero figure (1 min)
+   2. Run standalone demo (2 min)
+   3. Walk through one reasoning trace (2 min)
+
+   ## What's Next
+
+   **If feedback is positive**:
+   - Evaluate on real-world benchmark (bAbI, CLEVR, etc.)
+   - Scale to larger models
+   - Develop Anthropic pitch (alignment angle)
+
+   **If feedback identifies gaps**:
+   - Address specific concerns
+   - Iterate before wider sharing
+
+   ## Files to Share
+
+   Core package:
+   - NSM_RESEARCH_SUMMARY.pdf (2-page overview)
+   - NSM_HERO_FIGURE.png (key visual)
+   - Link to GitHub repo
+   - Link to standalone demo
+
+   Optional (if they want details):
+   - MULTI_SEED_RESULTS.md
+   - BASELINE_COMPARISON.md
+   - INTERPRETABILITY_DEMO.md
+   ```
+
+**Deliverable**:
+- `results/NSM_HERO_FIGURE.png` (presentation-quality)
+- `results/NSM_HERO_FIGURE.pdf` (publication-quality)
+- `TWO_WEEK_SPRINT_RESULTS.md` (comprehensive summary)
+- Completed checklist (all items checked)
+
+**Time**: 16 hours (2 days x 8 hours)
+
+---
+
+## Cost & Resource Summary
+
+**Total Time**: 14 days (1 person full-time = 112 hours)
+
+**Modal Compute Costs**:
+- Multi-seed validation: 3 runs × $10 = $30
+- Baseline comparisons: 3 runs × $7 = $21
+- Buffer for failures/reruns: $49
+- **Total: ~$100**
+
+**Required Skills**:
+- Python/PyTorch (moderate)
+- Matplotlib/visualization (basic)
+- Technical writing (moderate)
+- LaTeX/figure design (basic)
+
+**External Dependencies**:
+- Modal account (for training)
+- GitHub account (for releases)
+- LaTeX/Inkscape (for hero figure - optional, can use Python)
+
+---
+
+## Success Metrics
+
+**Minimum Viable Demo** (must achieve):
+- [ ] ≥3 seeds complete with std < 5%
+- [ ] NSM beats all baselines by ≥2%
+- [ ] 5 interpretability examples with visualizations
+- [ ] Standalone demo runs in <5 minutes
+
+**Share-Ready Package** (goal):
+- [ ] 2-page summary is clear to non-experts
+- [ ] Hero figure tells the story at a glance
+- [ ] Can answer "what problem?" in one sentence
+- [ ] No embarrassing gaps in anticipated questions
+
+**Confidence to Share**:
+- [ ] Would not waste their time
+- [ ] Have defensible claims
+- [ ] Can demo in real-time
+- [ ] Honest about limitations
+
+---
+
+## Risk Mitigation
+
+**If multi-seed experiments fail again**:
+- Debug timeout issues (increase timeout, reduce batch size)
+- Fall back to 2 seeds if necessary (acknowledge limitation)
+- Emphasize single-seed result consistency
+
+**If baselines beat NSM**:
+- Investigate why (architecture issue? hyperparameters?)
+- Pivot framing: "interpretability with competitive accuracy"
+- Be honest: "baselines win on accuracy, we win on interpretability"
+
+**If interpretability visualizations are nonsense**:
+- Debug what each level actually learns
+- May need to retrain with interpretability constraints
+- Worst case: pivot to "physics metrics" as main contribution
+
+**If we run out of time**:
+- Prioritize: Multi-seed > Baselines > Interpretability > Documentation
+- Can share incomplete package with "work in progress" framing
+- Better to wait an extra week than share too early
+
+---
+
+## Next Steps After Sprint
+
+**If sprint succeeds**:
+1. Share with SSI/SoftMax contact
+2. Collect feedback
+3. Iterate based on input
+4. Consider Anthropic if feedback is positive
+
+**If sprint reveals fundamental issues**:
+1. Document learnings
+2. Decide: pivot vs. persist
+3. May need month-long effort instead of 2 weeks
+
+**Long-term (3-6 months)**:
+- Real-world benchmark evaluation
+- Publication submission (NeurIPS, ICLR)
+- Deployment case study
+
+---
+
+## Daily Standup Template
+
+Use this to track progress:
+
+```markdown
+# Day X Progress
+
+## Completed Today
+- [ ] Task 1
+- [ ] Task 2
+
+## Blocked On
+- Issue 1: [description]
+
+## Tomorrow's Plan
+- [ ] Task 3
+- [ ] Task 4
+
+## Risks/Questions
+- Concern 1
+- Question 2
+```
+
+---
+
+## Final Thoughts
+
+This sprint is ambitious but achievable. The key is maintaining focus on the core question: **"Would sharing this waste someone's time?"**
+
+After 2 weeks, you should have a compelling demo that:
+1. Makes defensible scientific claims (multi-seed validation)
+2. Shows clear value (beats baselines, provides interpretability)
+3. Can be experienced in 5 minutes (standalone demo)
+4. Acknowledges limitations honestly (FAQ, limitations section)
+
+That's the difference between "interesting research prototype" and "half-baked work." The foundation is solid - we just need to package it properly.
+
+**Let's make NSM share-worthy! 🚀**
diff --git a/analysis/README_ISOMORPHISMS.md b/analysis/README_ISOMORPHISMS.md
index 875e0ef..9e8d53c 100644
--- a/analysis/README_ISOMORPHISMS.md
+++ b/analysis/README_ISOMORPHISMS.md
@@ -1,4 +1,4 @@
-# Physics Isomorphisms for Neural Collapse Prediction
+# Physics-Inspired Empirical Heuristics for Neural Collapse Prediction
 
 **Analysis Date**: 2025-10-23
 **Context**: NSM-33 Physics-Inspired Collapse Prediction (Pilot Results)
@@ -8,15 +8,17 @@
 
 ## Overview
 
-This directory contains analysis of **6 mathematical/physical isomorphisms** for predicting and preventing neural collapse in the NSM 6-level chiral architecture:
+This directory contains analysis of **6 empirical heuristics (originally framed as physical isomorphisms)** for predicting and preventing neural collapse in the NSM 6-level chiral architecture:
 
-1. **Fusion-Plasma** (NSM-33, validated) - Safety factor q_neural, temperature profiles, Lawson criterion
+1. **Fusion-Plasma** (NSM-33, validated) - Safety factor q_neural, representation variance profiles, Lawson criterion
 2. **Phase Transitions** (NEW) - Critical slowing, hysteresis, universal scaling
 3. **Control Theory** (NEW) - PID control, anti-windup, optimal damping
-4. **Hydrodynamics** (NEW) - Rayleigh-Bénard convection, temperature inversion
+4. **Hydrodynamics** (NEW) - Rayleigh-Bénard convection, variance inversion
 5. **Quantum Ising** (NEW) - Ferromagnetic coupling, spontaneous symmetry breaking
 6. **Catastrophe Theory** (NEW) - Cusp singularity, bistability, fold bifurcations
 
+**Note on Terminology**: These metrics are inspired by physical systems and exhibit structural similarities, but are **empirical heuristics** rather than rigorous isomorphisms. Dimensional analysis reveals they lack the invariance properties required for true physical analogies. They remain useful predictive tools validated through experiment
+
 ---
 
 ## Key Files
@@ -50,14 +52,14 @@ This directory contains analysis of **6 mathematical/physical isomorphisms** for
 
 ### 2. Multiple Physics Domains Map to Same Structure
 
-All isomorphisms share:
+All heuristics share common mathematical structure:
 - **Order parameter**: ψ = 1 - |acc₀ - acc₁| (class balance)
-- **Control parameter**: Diversity weight (temperature analog)
+- **Control parameter**: Diversity weight (variance control)
 - **Bifurcation**: Stable → collapsed transition
 - **Hysteresis**: Forward ≠ backward paths
 - **Dynamics**: dψ/dt = -∂V/∂ψ + noise
 
-This is **not coincidence** - reflects universal behavior of nonlinear dynamical systems.
+This reflects universal behavior of nonlinear dynamical systems - the structural similarities are useful for prediction even without rigorous physical correspondence.
 
 ### 3. Physics Metrics Validated
 
diff --git a/docs/diversity_regularization.md b/docs/diversity_regularization.md
new file mode 100644
index 0000000..de52539
--- /dev/null
+++ b/docs/diversity_regularization.md
@@ -0,0 +1,262 @@
+# Diversity Regularization for Temperature Profile Correction
+
+## Overview
+
+Diversity regularization enforces the correct hierarchical ordering of representation variances (T_L1 < T_L2 < T_L3) in the 6-level chiral architecture. This addresses the temperature inversion bug discovered in NSM-33 pilot study.
+
+## Mathematical Formulation
+
+### Temperature (Representation Variance)
+
+At each level k, the temperature is defined as the mean variance across feature dimensions:
+
+```
+T_Lk = mean(var(x_Lk, dim=samples))
+```
+
+Where:
+- `x_Lk ∈ ℝ^(N × d)` are the representations at level k
+- N = number of nodes/samples
+- d = feature dimensionality
+
+### Desired Profile
+
+The correct hierarchical ordering should follow information bottleneck principle:
+
+```
+T_L1 < T_L2 < T_L3
+```
+
+Where:
+- **L1 (concrete)**: Low variance - specialized, task-specific features
+- **L2 (intermediate)**: Medium variance - compositional features
+- **L3 (abstract)**: High variance - diverse conceptual representations
+
+### Regularization Loss
+
+The diversity loss penalizes violations of the hierarchical ordering:
+
+```python
+L_diversity = λ_div × [
+    ReLU(T_L1 - T_L2) +              # Penalize L1 > L2
+    ReLU(T_L2 - T_L3) +              # Penalize L2 > L3
+    ReLU(γ_target - (T_L3 - T_L1))  # Encourage minimum gradient
+]
+```
+
+Where:
+- λ_div = diversity regularization weight (default: 0.1)
+- γ_target = target minimum gradient (default: 0.1)
+- ReLU(x) = max(0, x)
+
+## Implementation
+
+### DiversityRegularization Module
+
+```python
+class DiversityRegularization(nn.Module):
+    """
+    Enforce correct temperature profile: L1 < L2 < L3 in diversity.
+
+    Location: nsm/models/chiral_fixed_temp.py:27-92
+    """
+
+    def __init__(self, weight: float = 0.1):
+        super().__init__()
+        self.weight = weight  # λ_div
+
+    def forward(
+        self,
+        x_l1: torch.Tensor,  # [N, d] representations at L1
+        x_l2: torch.Tensor,  # [N, d] representations at L2
+        x_l3: torch.Tensor   # [N, d] representations at L3
+    ) -> Tuple[torch.Tensor, Dict[str, float]]:
+        """
+        Compute diversity regularization loss.
+
+        Returns:
+            loss: Scalar tensor
+            diagnostics: Dict with T_L1, T_L2, T_L3, T_gradient
+        """
+        # Compute temperatures (variances)
+        T_L1 = x_l1.var(dim=0).mean()  # Mean variance across features
+        T_L2 = x_l2.var(dim=0).mean()
+        T_L3 = x_l3.var(dim=0).mean()
+
+        loss = torch.tensor(0.0, device=x_l1.device)
+
+        # Penalize inversions
+        if T_L2 < T_L1:
+            loss = loss + F.relu(T_L1 - T_L2)
+
+        if T_L3 < T_L2:
+            loss = loss + F.relu(T_L2 - T_L3)
+
+        # Encourage minimum gradient
+        gradient = T_L3 - T_L1
+        target_gradient = 0.1
+
+        if gradient < target_gradient:
+            loss = loss + F.relu(target_gradient - gradient)
+
+        loss = loss * self.weight
+
+        return loss, diagnostics
+```
+
+### Integration with Loss Function
+
+```python
+class FixedTemperatureChiralLoss(nn.Module):
+    """
+    Composite loss including diversity regularization.
+
+    Location: nsm/models/chiral_fixed_temp.py:154-242
+    """
+
+    def forward(self, model_output, targets):
+        # Standard task + auxiliary + cycle losses
+        loss_task = self.task_criterion(model_output['logits'], targets)
+        loss_aux = ...
+        loss_cycle = ...
+
+        # Diversity regularization (added)
+        loss_diversity = model_output.get('diversity_loss', 0.0)
+
+        # Total composite loss
+        L_total = (
+            λ_task × loss_task +
+            λ_aux × loss_aux +
+            λ_cycle × loss_cycle +
+            λ_div × loss_diversity  # NEW
+        )
+
+        return {'loss': L_total, ...}
+```
+
+## Hyperparameters
+
+| Parameter | Symbol | Default | Range | Description |
+|-----------|--------|---------|-------|-------------|
+| Diversity weight | λ_div | 0.1 | [0.01, 0.5] | Global scaling of diversity loss |
+| Target gradient | γ_target | 0.1 | [0.05, 0.3] | Minimum required T_L3 - T_L1 |
+
+### Tuning Guidelines
+
+**λ_div too low (< 0.05):**
+- Temperature inversions persist
+- q_neural remains unstable
+- Class imbalance issues
+
+**λ_div too high (> 0.3):**
+- Dominates other losses
+- May prevent task learning
+- Representations become overly dispersed
+
+**Recommended:** Start at 0.1, increase if inversions persist after 5 epochs.
+
+## Results (NSM-33 Track C)
+
+### Before Fix (Pilot Study, N=2K)
+```
+T_L1: 0.40 → T_L2: 0.25 → T_L3: 0.13
+Gradient: -0.27 [INVERTED]
+q_neural: 0.45 (COLLAPSE RISK)
+Accuracy: 48.16%
+```
+
+### After Fix (10x Scale, N=20K, λ_div=0.1)
+```
+T_L1: 0.36 → T_L2: 4.16 → T_L3: 19.53
+Gradient: +19.17 [NORMAL]
+q_neural: 0.625 (marginal stability)
+Accuracy: 65.57%
+```
+
+### Analysis
+
+✅ **Temperature profile corrected** - Gradient changed from -0.27 to +19.17
+
+⚠️ **q_neural still below 1.0** - Suggests other stability factors at play
+
+✅ **Accuracy improved** - +17.41 percentage points
+
+**Confound:** Scale effect (2K → 20K) dominates diversity regularization effect. Need ablation study at same scale.
+
+## Theoretical Justification
+
+### Information Bottleneck Perspective
+
+Tishby & Zaslavsky (2015) show that deep networks exhibit two phases:
+1. **Fitting phase**: Representations increase mutual information I(X; T)
+2. **Compression phase**: Higher layers compress I(T; X) while preserving I(T; Y)
+
+**Prediction:** Higher layers (L3) should have **higher entropy** (variance) of representations to maintain diverse abstract concepts, while lower layers (L1) compress to task-relevant features.
+
+**Our observations align with this theory.**
+
+### Why Inversions Are Problematic
+
+**Hypothesis:** When T_L1 > T_L3, the architecture:
+1. Overfits at concrete level (high variance in L1 = memorization)
+2. Underspecifies at abstract level (low variance in L3 = collapsed concepts)
+3. Violates hierarchical abstraction (information flows "uphill")
+
+**Analogy:** Like a neural network with bottleneck at the wrong end.
+
+### Alternative Interpretation (Peer Review Concern)
+
+**Reviewer's critique:** Compression may be HEALTHY, not pathological. High variance in L3 might indicate:
+- Insufficient training (representations not converged)
+- Regularization preventing compression
+- Fighting against natural information bottleneck
+
+**Counter-evidence:**
+- Fixed architecture has **worse** class balance (11.48% vs 5.91%)
+- Fixed architecture has **lower** q_neural (0.625 vs 1.336)
+- Scale alone (baseline) achieves better results
+
+**Conclusion:** Effect is **CONFOUNDED** - scale dominates diversity regularization. Need controlled ablation.
+
+## Recommended Ablation Study
+
+To isolate diversity regularization effect:
+
+| Condition | N | λ_div | Expected Result |
+|-----------|---|-------|-----------------|
+| Baseline-2K | 2,000 | 0.0 | Inverted profile (replicate pilot) |
+| Fixed-2K | 2,000 | 0.1 | Test if diversity fixes at small scale |
+| Baseline-20K | 20,000 | 0.0 | Already done (67.11%) |
+| Fixed-20K | 20,000 | 0.1 | Already done (65.57%) |
+| **NEW** Baseline-20K-no-reg | 20,000 | 0.0 | Control: Scale without regularization |
+
+**Critical test:** Does Fixed-2K correct inversion without scale?
+
+## Limitations
+
+1. **No dimensional analysis** - Temperatures have arbitrary units (not dimensionless)
+2. **Threshold (γ=0.1) arbitrary** - Not derived from theory
+3. **Scale confound** - Cannot separate diversity effect from data sufficiency
+4. **Single dataset** - Generalization unknown
+5. **No causal evidence** - Correlation between profile and stability, not causation
+
+## Future Work
+
+1. **Information-theoretic reformulation** - Replace variance with mutual information I(X_L; Y)
+2. **Adaptive γ_target** - Scale with model capacity and task complexity
+3. **Per-layer regularization** - Different λ_div for each level
+4. **Multi-dataset validation** - Test on KG, causal reasoning domains
+5. **Ablation at fixed scale** - Isolate diversity effect from scale effect
+
+## References
+
+- Tishby & Zaslavsky (2015). "Deep Learning and the Information Bottleneck Principle"
+- Shwartz-Ziv & Tishby (2017). "Opening the Black Box of Deep Neural Networks"
+- Saxe et al. (2019). "On the Information Bottleneck Theory of Deep Learning"
+
+## See Also
+
+- `nsm/models/chiral_fixed_temp.py` - Implementation
+- `experiments/modal_10x_fixed_temp.py` - Validation experiment
+- `results/NSM-33_10x_validation_results.md` - Empirical results
+- `docs/physics_metrics.md` - Related stability metrics
diff --git a/experiments/modal_10x_baseline.py b/experiments/modal_10x_baseline.py
index 9b7d330..675342a 100644
--- a/experiments/modal_10x_baseline.py
+++ b/experiments/modal_10x_baseline.py
@@ -68,6 +68,7 @@ def validate_10x_baseline():
     from nsm.training.physics_metrics import compute_all_physics_metrics
     from nsm.data.planning_dataset import PlanningTripleDataset
     from nsm.data.utils import adaptive_train_val_split
+    from nsm.utils.checkpoint_manager import CheckpointManager
 
     print("="*70)
     print("10X SCALED BASELINE VALIDATION (N=20,000)")
@@ -179,6 +180,10 @@ def pyg_collate(data_list):
     # Optimizer
     optimizer = torch.optim.Adam(model.parameters(), lr=config["learning_rate"])
 
+    # Initialize checkpoint manager
+    checkpoint_manager = CheckpointManager("/checkpoints", "nsm-10x-baseline")
+    print(f"Checkpoint manager initialized: {checkpoint_manager.checkpoint_dir}")
+
     # Training loop
     print("\n" + "="*70)
     print("TRAINING WITH 10X SCALED DATASET")
@@ -387,17 +392,45 @@ def pyg_collate(data_list):
 
         history.append(epoch_data)
 
-        # Early stopping
-        if val_accuracy > best_val_accuracy:
+        # Save checkpoint and check early stopping
+        is_best = val_accuracy > best_val_accuracy
+
+        if is_best:
             best_val_accuracy = val_accuracy
             best_val_loss = val_loss
             patience_counter = 0
             print(f"\n  New best accuracy: {best_val_accuracy:.4f}")
         else:
             patience_counter += 1
-            if patience_counter >= config["patience"]:
-                print(f"\n  Early stopping triggered (patience={config['patience']})")
-                break
+
+        # Save checkpoint (every epoch)
+        checkpoint_metrics = {
+            "val_accuracy": val_accuracy,
+            "val_loss": val_loss,
+            "class_balance_delta": class_balance_delta
+        }
+
+        # Add physics metrics if available
+        if physics_metrics:
+            checkpoint_metrics["q_neural"] = physics_metrics['q_neural']
+            checkpoint_metrics["Q_factor"] = physics_metrics['Q_factor']
+
+        checkpoint_manager.save_checkpoint(
+            model=model,
+            epoch=epoch + 1,
+            metrics=checkpoint_metrics,
+            config=config,
+            optimizer=optimizer,
+            is_best=is_best
+        )
+
+        # Commit volume after saving checkpoint
+        volume.commit()
+
+        # Check early stopping
+        if patience_counter >= config["patience"]:
+            print(f"\n  Early stopping triggered (patience={config['patience']})")
+            break
 
     # Final results
     print("\n" + "="*70)
diff --git a/experiments/modal_combined_validation.py b/experiments/modal_combined_validation.py
new file mode 100644
index 0000000..088cc8f
--- /dev/null
+++ b/experiments/modal_combined_validation.py
@@ -0,0 +1,660 @@
+"""
+Modal GPU validation script with COMBINED fix: L3 diversity regularization + adaptive training control.
+
+This implements BOTH approaches together:
+1. ARCHITECTURAL FIX: L3 diversity regularization in FullChiralModel
+2. RUNTIME ADAPTATION: Adaptive hyperparameter adjustment based on physics metrics
+
+The hypothesis is that these fixes are synergistic:
+- Diversity regularization maintains "temperature" (representation spread)
+- Adaptive control dynamically adjusts loss weights when physics metrics warn of collapse
+
+Usage:
+    modal run experiments/modal_combined_validation.py::validate_combined_fix
+"""
+
+import modal
+import sys
+from pathlib import Path
+
+# Modal app configuration
+app = modal.App("nsm-combined-fix")
+
+# Project root for local imports
+PROJECT_ROOT = Path(__file__).parent.parent.absolute()
+
+# Modal image with dependencies
+image = (
+    modal.Image.debian_slim(python_version="3.10")
+    .pip_install(
+        "numpy<2",  # Pin to NumPy 1.x for torch-scatter compatibility
+        "torch==2.1.0",
+        "torch-geometric==2.4.0",
+        "tqdm",
+    )
+    .run_commands(
+        "pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.1.0+cpu.html"
+    )
+    .add_local_dir(PROJECT_ROOT, "/root/NSM", copy=True, ignore=["*.pyc", "__pycache__", ".git", "logs", "checkpoints", "data", ".pytest_cache"])
+)
+
+# Modal volume for checkpoints
+volume = modal.Volume.from_name("nsm-checkpoints", create_if_missing=True)
+
+
+class AdaptiveTrainingController:
+    """
+    Runtime adaptation based on physics metrics.
+
+    Dynamically adjusts loss weights when collapse risk is detected:
+    - If q_neural < 1.0: Increase diversity_weight
+    - If temperature inverted: Increase cycle_weight
+    - If Lawson Q < 1.0: Reduce learning rate
+    """
+
+    def __init__(
+        self,
+        initial_diversity_weight: float = 0.0,
+        initial_cycle_weight: float = 0.01,
+        initial_lr: float = 1e-4,
+        min_diversity_weight: float = 0.0,
+        max_diversity_weight: float = 0.5,
+        min_cycle_weight: float = 0.01,
+        max_cycle_weight: float = 0.1,
+        min_lr: float = 1e-5,
+        max_lr: float = 1e-3
+    ):
+        self.diversity_weight = initial_diversity_weight
+        self.cycle_weight = initial_cycle_weight
+        self.lr = initial_lr
+
+        self.min_diversity_weight = min_diversity_weight
+        self.max_diversity_weight = max_diversity_weight
+        self.min_cycle_weight = min_cycle_weight
+        self.max_cycle_weight = max_cycle_weight
+        self.min_lr = min_lr
+        self.max_lr = max_lr
+
+        self.adjustment_history = []
+
+    def update(
+        self,
+        physics_metrics: dict,
+        epoch: int,
+        optimizer: any
+    ) -> dict:
+        """
+        Update hyperparameters based on physics metrics.
+
+        Args:
+            physics_metrics: Dict from compute_all_physics_metrics
+            epoch: Current epoch
+            optimizer: PyTorch optimizer (for LR adjustment)
+
+        Returns:
+            Dict of adjustments made
+        """
+        adjustments = {
+            'epoch': epoch,
+            'diversity_weight_old': self.diversity_weight,
+            'cycle_weight_old': self.cycle_weight,
+            'lr_old': self.lr,
+            'actions': []
+        }
+
+        # Action 1: Increase diversity weight if q_neural < 1.0
+        if physics_metrics['q_neural'] < 1.0:
+            old_weight = self.diversity_weight
+            # Increase by 50% (multiplicative), capped at max
+            self.diversity_weight = min(
+                self.diversity_weight * 1.5 + 0.05,  # Add 0.05 if starting at 0
+                self.max_diversity_weight
+            )
+            adjustments['actions'].append(
+                f"⚡ Increased diversity_weight: {old_weight:.4f} → {self.diversity_weight:.4f} (q={physics_metrics['q_neural']:.3f} < 1)"
+            )
+
+        # Action 2: Increase cycle weight if temperature inverted
+        if physics_metrics.get('profile_type') == 'inverted':
+            old_weight = self.cycle_weight
+            # Increase by 30%
+            self.cycle_weight = min(
+                self.cycle_weight * 1.3,
+                self.max_cycle_weight
+            )
+            adjustments['actions'].append(
+                f"⚡ Increased cycle_weight: {old_weight:.4f} → {self.cycle_weight:.4f} (temperature inverted)"
+            )
+
+        # Action 3: Reduce LR if Lawson Q < 0.5 (deep subignition)
+        if physics_metrics['Q_factor'] < 0.5:
+            old_lr = self.lr
+            # Reduce by 20%
+            self.lr = max(
+                self.lr * 0.8,
+                self.min_lr
+            )
+            # Apply to optimizer
+            for param_group in optimizer.param_groups:
+                param_group['lr'] = self.lr
+            adjustments['actions'].append(
+                f"⚡ Reduced learning_rate: {old_lr:.6f} → {self.lr:.6f} (Q={physics_metrics['Q_factor']:.3f} < 0.5)"
+            )
+
+        # Action 4: Restore diversity weight if system stable
+        if physics_metrics['q_neural'] > 1.5 and self.diversity_weight > self.min_diversity_weight:
+            old_weight = self.diversity_weight
+            # Gradually reduce (don't eliminate entirely)
+            self.diversity_weight = max(
+                self.diversity_weight * 0.9,
+                self.min_diversity_weight
+            )
+            adjustments['actions'].append(
+                f"⚡ Reduced diversity_weight: {old_weight:.4f} → {self.diversity_weight:.4f} (q={physics_metrics['q_neural']:.3f} > 1.5, stable)"
+            )
+
+        adjustments['diversity_weight_new'] = self.diversity_weight
+        adjustments['cycle_weight_new'] = self.cycle_weight
+        adjustments['lr_new'] = self.lr
+
+        self.adjustment_history.append(adjustments)
+
+        return adjustments
+
+
+@app.function(
+    image=image,
+    gpu="A100",
+    timeout=3600,
+    volumes={"/checkpoints": volume}
+)
+def validate_combined_fix():
+    """
+    Validate 6-level chiral architecture with COMBINED fix:
+    1. L3 diversity regularization (architectural)
+    2. Adaptive training control (runtime)
+    """
+    import json
+    import torch
+    import torch.nn.functional as F
+    from torch.utils.data import DataLoader
+    from torch_geometric.data import Batch
+    from datetime import datetime
+    from tqdm import tqdm
+
+    # Add NSM to path
+    sys.path.insert(0, "/root/NSM")
+
+    from nsm.models.chiral import FullChiralModel
+    from nsm.training.chiral_loss import ChiralCompositeLoss, compute_class_balance_metrics
+    from nsm.training.physics_metrics import compute_all_physics_metrics
+    from nsm.data.planning_dataset import PlanningTripleDataset
+
+    print("="*70)
+    print("COMBINED FIX VALIDATION - NSM-33")
+    print("="*70)
+    print("\nTesting synergistic approach:")
+    print("  1. ARCHITECTURAL: L3 diversity regularization")
+    print("  2. RUNTIME: Adaptive hyperparameter control")
+    print("="*70)
+
+    # Configuration
+    config = {
+        "variant": "6level_combined_fix",
+        "epochs": 20,  # More epochs to test adaptation
+        "batch_size": 64,
+        "learning_rate": 1e-4,
+        "seed": 42,
+        "pool_ratio": 0.5,
+        "dropout": 0.1,
+        "patience": 20,
+
+        # Initial loss weights (will be adapted)
+        "task_weight": 1.0,
+        "aux_weight": 0.3,
+        "cycle_weight": 0.01,  # Will increase if needed
+        "diversity_weight": 0.0,  # Will increase if needed (starts at 0)
+
+        # Adaptive control ranges
+        "max_diversity_weight": 0.3,
+        "max_cycle_weight": 0.1,
+        "min_lr": 1e-5,
+
+        # Optional focal loss
+        "use_focal_loss": False,
+        "focal_alpha": 0.25,
+        "focal_gamma": 2.0,
+
+        # Physics metrics
+        "track_physics_metrics": True,
+        "task_complexity": 1.0,
+
+        # Enable adaptive control
+        "use_adaptive_control": True
+    }
+
+    torch.manual_seed(config["seed"])
+
+    # Load dataset
+    print("\nLoading Planning dataset...")
+    full_dataset = PlanningTripleDataset(root="/tmp/planning", split="train", num_problems=4100)
+
+    # Materialize all graphs into a list
+    print(f"Total dataset size: {len(full_dataset)}")
+    all_graphs = [full_dataset[i] for i in range(len(full_dataset))]
+    print(f"Materialized {len(all_graphs)} graphs")
+
+    # Split into train/val
+    train_size = 2000
+    train_graphs = all_graphs[:train_size]
+    val_graphs = all_graphs[train_size:]
+
+    # Create DataLoaders with explicit collate function
+    def pyg_collate(data_list):
+        graphs = [item[0] for item in data_list]
+        labels = torch.tensor([item[1] for item in data_list])
+        batch = Batch.from_data_list(graphs)
+        batch.y = labels
+        return batch
+
+    print(f"Train samples: {len(train_graphs)}")
+    print(f"Val samples: {len(val_graphs)}")
+
+    train_loader = DataLoader(train_graphs, batch_size=config["batch_size"], shuffle=True, collate_fn=pyg_collate)
+    val_loader = DataLoader(val_graphs, batch_size=config["batch_size"], shuffle=False, collate_fn=pyg_collate)
+
+    # Get data properties from first batch
+    print("Fetching first batch...")
+    sample = next(iter(train_loader))
+    node_features = sample.x.size(1)
+    num_relations = int(sample.edge_type.max().item()) + 1
+    num_classes = 2
+
+    print(f"\nDataset properties:")
+    print(f"  Node features: {node_features}")
+    print(f"  Num relations: {num_relations}")
+    print(f"  Num classes: {num_classes}")
+
+    # Initialize model (with L3 diversity regularization)
+    print("\nInitializing FullChiralModel with L3 diversity regularization...")
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+    model = FullChiralModel(
+        node_features=node_features,
+        num_relations=num_relations,
+        num_classes=num_classes,
+        pool_ratio=config["pool_ratio"],
+        task_type='classification',
+        dropout=config["dropout"]
+    ).to(device)
+
+    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
+
+    # Initialize adaptive controller
+    controller = None
+    if config["use_adaptive_control"]:
+        print("\nInitializing AdaptiveTrainingController...")
+        controller = AdaptiveTrainingController(
+            initial_diversity_weight=config["diversity_weight"],
+            initial_cycle_weight=config["cycle_weight"],
+            initial_lr=config["learning_rate"],
+            max_diversity_weight=config["max_diversity_weight"],
+            max_cycle_weight=config["max_cycle_weight"],
+            min_lr=config["min_lr"]
+        )
+
+    # Initialize loss function
+    criterion = ChiralCompositeLoss(
+        task_weight=config["task_weight"],
+        aux_weight=config["aux_weight"],
+        cycle_weight=config["cycle_weight"],
+        diversity_weight=config["diversity_weight"],
+        use_focal_loss=config["use_focal_loss"],
+        focal_alpha=config["focal_alpha"],
+        focal_gamma=config["focal_gamma"]
+    )
+
+    # Optimizer
+    optimizer = torch.optim.Adam(model.parameters(), lr=config["learning_rate"])
+
+    # Training loop
+    print("\n" + "="*70)
+    print("TRAINING WITH COMBINED FIX")
+    print("="*70)
+
+    best_val_accuracy = 0.0
+    best_val_loss = float('inf')
+    patience_counter = 0
+
+    history = []
+
+    for epoch in range(config["epochs"]):
+        # Train
+        model.train()
+        train_loss = 0.0
+        train_loss_task = 0.0
+        train_loss_aux = 0.0
+        train_loss_cycle = 0.0
+        train_loss_diversity = 0.0
+
+        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{config['epochs']} [Train]"):
+            batch = batch.to(device)
+
+            # Forward pass
+            output = model(batch.x, batch.edge_index, batch.edge_type, batch.batch)
+
+            # Update loss weights from controller
+            if controller is not None:
+                criterion.diversity_weight = controller.diversity_weight
+                criterion.cycle_weight = controller.cycle_weight
+
+            # Compute loss
+            loss_dict = criterion(output, batch.y, task_type='classification')
+
+            # Backward
+            optimizer.zero_grad()
+            loss_dict['loss'].backward()
+
+            # Gradient clipping to prevent explosion
+            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+
+            optimizer.step()
+
+            train_loss += loss_dict['loss'].item()
+            train_loss_task += loss_dict['loss_task'].item()
+            train_loss_aux += loss_dict['loss_task_aux'].item()
+            train_loss_cycle += loss_dict['loss_cycle'].item()
+            train_loss_diversity += loss_dict.get('loss_diversity', 0.0)
+
+        train_loss /= len(train_loader)
+        train_loss_task /= len(train_loader)
+        train_loss_aux /= len(train_loader)
+        train_loss_cycle /= len(train_loader)
+        train_loss_diversity /= len(train_loader)
+
+        # Validate
+        model.eval()
+        val_loss = 0.0
+        val_loss_task = 0.0
+        val_loss_aux = 0.0
+        val_loss_cycle = 0.0
+        val_loss_diversity = 0.0
+        correct_total = 0
+        correct_class_0 = 0
+        correct_class_1 = 0
+        total_class_0 = 0
+        total_class_1 = 0
+        total = 0
+
+        # For physics metrics: collect level representations
+        all_level_reps_l1 = []
+        all_level_reps_l2 = []
+        all_level_reps_l3 = []
+
+        with torch.no_grad():
+            for batch in tqdm(val_loader, desc=f"Epoch {epoch+1}/{config['epochs']} [Val]"):
+                batch = batch.to(device)
+
+                # Forward pass
+                output = model(batch.x, batch.edge_index, batch.edge_type, batch.batch)
+
+                # Collect level representations for physics metrics
+                if 'x_l1' in output:
+                    all_level_reps_l1.append(output['x_l1'].cpu())
+                if 'x_l2' in output:
+                    all_level_reps_l2.append(output['x_l2'].cpu())
+                if 'x_l3' in output:
+                    all_level_reps_l3.append(output['x_l3'].cpu())
+
+                # Compute loss
+                loss_dict = criterion(output, batch.y, task_type='classification')
+
+                val_loss += loss_dict['loss'].item()
+                val_loss_task += loss_dict['loss_task'].item()
+                val_loss_aux += loss_dict['loss_task_aux'].item()
+                val_loss_cycle += loss_dict['loss_cycle'].item()
+                val_loss_diversity += loss_dict.get('loss_diversity', 0.0)
+
+                # Accuracy
+                pred = output['logits'].argmax(dim=1)
+                correct_total += (pred == batch.y).sum().item()
+                total += batch.y.size(0)
+
+                # Per-class accuracy
+                for cls in [0, 1]:
+                    mask = (batch.y == cls)
+                    if mask.sum() > 0:
+                        if cls == 0:
+                            correct_class_0 += (pred[mask] == cls).sum().item()
+                            total_class_0 += mask.sum().item()
+                        else:
+                            correct_class_1 += (pred[mask] == cls).sum().item()
+                            total_class_1 += mask.sum().item()
+
+        val_loss /= len(val_loader)
+        val_loss_task /= len(val_loader)
+        val_loss_aux /= len(val_loader)
+        val_loss_cycle /= len(val_loader)
+        val_loss_diversity /= len(val_loader)
+        val_accuracy = correct_total / total
+        val_accuracy_class_0 = correct_class_0 / total_class_0 if total_class_0 > 0 else 0
+        val_accuracy_class_1 = correct_class_1 / total_class_1 if total_class_1 > 0 else 0
+        class_balance_delta = abs(val_accuracy_class_0 - val_accuracy_class_1)
+
+        # ===== PHYSICS METRICS =====
+        physics_metrics = {}
+        if config["track_physics_metrics"]:
+            # Prepare class accuracies
+            class_accs = {
+                'accuracy_class_0': val_accuracy_class_0,
+                'accuracy_class_1': val_accuracy_class_1
+            }
+
+            # Prepare level representations (concatenate batches)
+            level_reps = {}
+            if all_level_reps_l1:
+                level_reps['L1'] = torch.cat(all_level_reps_l1, dim=0)
+            if all_level_reps_l2:
+                level_reps['L2'] = torch.cat(all_level_reps_l2, dim=0)
+            if all_level_reps_l3:
+                level_reps['L3'] = torch.cat(all_level_reps_l3, dim=0)
+
+            # Compute all physics metrics
+            physics_metrics = compute_all_physics_metrics(
+                model=model,
+                class_accuracies=class_accs,
+                level_representations=level_reps,
+                epoch=epoch + 1,
+                task_complexity=config["task_complexity"]
+            )
+
+        # ===== ADAPTIVE CONTROL =====
+        adjustments = None
+        if controller is not None and physics_metrics:
+            adjustments = controller.update(physics_metrics, epoch + 1, optimizer)
+
+        # Log standard metrics
+        print(f"\n{'='*70}")
+        print(f"Epoch {epoch+1}/{config['epochs']}")
+        print(f"{'='*70}")
+        print(f"  Train Loss: {train_loss:.4f} (task: {train_loss_task:.4f}, aux: {train_loss_aux:.4f}, cycle: {train_loss_cycle:.4f}, div: {train_loss_diversity:.4f})")
+        print(f"  Val Loss: {val_loss:.4f} (task: {val_loss_task:.4f}, aux: {val_loss_aux:.4f}, cycle: {val_loss_cycle:.4f}, div: {val_loss_diversity:.4f})")
+        print(f"  Val Accuracy: {val_accuracy:.4f} (class 0: {val_accuracy_class_0:.4f}, class 1: {val_accuracy_class_1:.4f})")
+        print(f"  Class Balance Δ: {class_balance_delta:.4f}")
+
+        # Log current loss weights
+        if controller is not None:
+            print(f"\n  Current Loss Weights:")
+            print(f"    diversity_weight: {controller.diversity_weight:.4f}")
+            print(f"    cycle_weight: {controller.cycle_weight:.4f}")
+            print(f"    learning_rate: {controller.lr:.6f}")
+
+        # Log physics metrics
+        if physics_metrics:
+            print(f"\n  Physics Metrics:")
+            print(f"    q_neural (safety factor): {physics_metrics['q_neural']:.3f} [{physics_metrics['stability']}]")
+            print(f"    Coupling strength: {physics_metrics['coupling_strength']:.3f}")
+
+            if 'T_L1' in physics_metrics:
+                print(f"    Temperature L1: {physics_metrics['T_L1']:.3f}")
+            if 'T_L2' in physics_metrics:
+                print(f"    Temperature L2: {physics_metrics['T_L2']:.3f}")
+            if 'T_L3' in physics_metrics:
+                print(f"    Temperature L3: {physics_metrics['T_L3']:.3f}")
+            if 'T_gradient' in physics_metrics:
+                print(f"    Temperature gradient: {physics_metrics['T_gradient']:.3f} [{physics_metrics['profile_type']}]")
+
+            print(f"    Lawson Q factor: {physics_metrics['Q_factor']:.3f} [{physics_metrics['status']}]")
+
+            # Display warnings
+            if physics_metrics['warnings']:
+                print(f"\n  ⚠️  WARNINGS [{physics_metrics['alert_level']}]:")
+                for warning in physics_metrics['warnings']:
+                    print(f"    {warning}")
+
+        # Log adaptive adjustments
+        if adjustments and adjustments['actions']:
+            print(f"\n  Adaptive Adjustments:")
+            for action in adjustments['actions']:
+                print(f"    {action}")
+
+        # Save epoch data
+        epoch_data = {
+            "epoch": epoch + 1,
+            "train_loss": train_loss,
+            "train_loss_task": train_loss_task,
+            "train_loss_aux": train_loss_aux,
+            "train_loss_cycle": train_loss_cycle,
+            "train_loss_diversity": train_loss_diversity,
+            "val_loss": val_loss,
+            "val_loss_task": val_loss_task,
+            "val_loss_aux": val_loss_aux,
+            "val_loss_cycle": val_loss_cycle,
+            "val_loss_diversity": val_loss_diversity,
+            "val_accuracy": val_accuracy,
+            "val_accuracy_class_0": val_accuracy_class_0,
+            "val_accuracy_class_1": val_accuracy_class_1,
+            "class_balance_delta": class_balance_delta,
+        }
+
+        # Add physics metrics to history
+        if physics_metrics:
+            epoch_data["physics_metrics"] = {
+                "q_neural": physics_metrics['q_neural'],
+                "stability": physics_metrics['stability'],
+                "coupling_strength": physics_metrics['coupling_strength'],
+                "T_L1": physics_metrics.get('T_L1', 0.0),
+                "T_L2": physics_metrics.get('T_L2', 0.0),
+                "T_L3": physics_metrics.get('T_L3', 0.0),
+                "T_gradient": physics_metrics.get('T_gradient', 0.0),
+                "profile_type": physics_metrics.get('profile_type', 'unknown'),
+                "Q_factor": physics_metrics['Q_factor'],
+                "lawson_status": physics_metrics['status'],
+                "alert_level": physics_metrics['alert_level'],
+                "warnings": physics_metrics['warnings']
+            }
+
+        # Add adaptive adjustments to history
+        if adjustments:
+            epoch_data["adaptive_adjustments"] = adjustments
+
+        history.append(epoch_data)
+
+        # Early stopping
+        if val_accuracy > best_val_accuracy:
+            best_val_accuracy = val_accuracy
+            best_val_loss = val_loss
+            patience_counter = 0
+            print(f"\n  ✓ New best accuracy: {best_val_accuracy:.4f}")
+        else:
+            patience_counter += 1
+            if patience_counter >= config["patience"]:
+                print(f"\n  Early stopping triggered (patience={config['patience']})")
+                break
+
+    # Final results
+    print("\n" + "="*70)
+    print("FINAL RESULTS & COMBINED FIX ANALYSIS")
+    print("="*70)
+
+    results = {
+        "variant_name": "6level_combined_fix",
+        "config": config,
+        "epochs_trained": epoch + 1,
+        "training_time_seconds": None,  # TODO: track time
+        "best_val_loss": best_val_loss,
+        "best_val_accuracy": best_val_accuracy,
+        "final_metrics": history[-1] if history else {},
+        "history": history,
+        "status": "completed"
+    }
+
+    if controller is not None:
+        results["adjustment_history"] = controller.adjustment_history
+
+    print(f"\nBest Val Accuracy: {best_val_accuracy:.4f}")
+    print(f"Final Class Balance Δ: {history[-1]['class_balance_delta']:.4f}")
+    print(f"Final Cycle Loss: {history[-1]['val_loss_cycle']:.4f}")
+    print(f"Final Diversity Loss: {history[-1]['val_loss_diversity']:.4f}")
+
+    # Analyze synergy
+    print(f"\n{'='*70}")
+    print("SYNERGY ANALYSIS")
+    print(f"{'='*70}")
+
+    if controller is not None and len(controller.adjustment_history) > 0:
+        print(f"\nAdaptive Control Summary:")
+        print(f"  Total adjustments made: {len([a for a in controller.adjustment_history if a['actions']])}")
+        print(f"  Final diversity_weight: {controller.diversity_weight:.4f} (initial: {config['diversity_weight']:.4f})")
+        print(f"  Final cycle_weight: {controller.cycle_weight:.4f} (initial: {config['cycle_weight']:.4f})")
+        print(f"  Final learning_rate: {controller.lr:.6f} (initial: {config['learning_rate']:.6f})")
+
+    # Comparison to baseline
+    baseline_accuracy = 0.5126
+    baseline_balance_delta = 0.2960
+
+    print(f"\nComparison to 3-level fusion baseline:")
+    print(f"  Accuracy: {best_val_accuracy:.4f} vs {baseline_accuracy:.4f} (Δ {best_val_accuracy - baseline_accuracy:+.4f})")
+    print(f"  Balance Δ: {history[-1]['class_balance_delta']:.4f} vs {baseline_balance_delta:.4f} (Δ {history[-1]['class_balance_delta'] - baseline_balance_delta:+.4f})")
+
+    # Success criteria from NSM-32
+    if best_val_accuracy >= 0.55 and history[-1]['class_balance_delta'] < 0.40:
+        print("\n✅ SUCCESS: Passed primary criteria (accuracy ≥55%, balance Δ <40%)")
+    else:
+        print("\n⚠️  PARTIAL: Did not meet all primary criteria")
+        if best_val_accuracy < 0.55:
+            print(f"   - Accuracy below target: {best_val_accuracy:.4f} < 0.55")
+        if history[-1]['class_balance_delta'] >= 0.40:
+            print(f"   - Balance delta above target: {history[-1]['class_balance_delta']:.4f} >= 0.40")
+
+    # Save results
+    output_path = "/tmp/6level_combined_fix_results.json"
+    with open(output_path, 'w') as f:
+        json.dump(results, f, indent=2)
+
+    print(f"\nResults saved to {output_path}")
+
+    return results
+
+
+@app.local_entrypoint()
+def main():
+    """
+    Local entrypoint for running combined fix validation.
+    """
+    print("Launching combined fix validation on Modal...")
+    results = validate_combined_fix.remote()
+
+    print("\n" + "="*70)
+    print("VALIDATION COMPLETE")
+    print("="*70)
+    print(f"\nFinal Accuracy: {results['best_val_accuracy']:.4f}")
+    print(f"Final Balance Δ: {results['final_metrics']['class_balance_delta']:.4f}")
+
+    # Display physics metrics summary
+    if "physics_metrics" in results['final_metrics']:
+        pm = results['final_metrics']['physics_metrics']
+        print(f"\nFinal Physics Metrics:")
+        print(f"  q_neural: {pm['q_neural']:.3f} [{pm['stability']}]")
+        print(f"  Q factor: {pm['Q_factor']:.3f} [{pm['lawson_status']}]")
+        print(f"  Alert level: {pm['alert_level']}")
diff --git a/nsm/training/physics_metrics.py b/nsm/training/physics_metrics.py
index 5babd38..388c8f5 100644
--- a/nsm/training/physics_metrics.py
+++ b/nsm/training/physics_metrics.py
@@ -1,15 +1,22 @@
 """
-Physics-inspired metrics for predicting class collapse in chiral neural architectures.
+Physics-inspired empirical heuristics for predicting class collapse in chiral neural architectures.
 
-Implements fusion-plasma isomorphism metrics:
+Implements fusion-plasma-inspired metrics:
 - Safety factor q_neural (stability predictor)
-- Temperature profiles (diversity tracking)
-- Lawson criterion (training success predictor)
+- Representation variance profiles (diversity tracking)
+- Lawson criterion analog (training success predictor)
 
-Based on the discovered mathematical parallels between:
+**Note**: These are empirical heuristics (not rigorous isomorphisms) inspired by structural
+similarities to fusion plasma systems. Dimensional analysis reveals they lack true physical
+correspondence, but remain useful predictive tools validated through NSM-33 experiments.
+
+**Peer Review**: Terminology updated per research-assistant feedback (2025-10-23).
+See TERMINOLOGY_UPDATES.md for complete rationale and change log.
+
+Mathematical parallels (structural, not isomorphic):
 - Neural class collapse ↔ Plasma confinement loss
-- α/β fusion parameters ↔ α/β hinge mixing weights
-- Temperature regulation ↔ Diversity maintenance
+- α/β hinge parameters ↔ α/β fusion parameters
+- Representation variance ↔ Temperature in fusion systems
 
 References:
 - Lawson, J.D. (1957). "Some Criteria for a Power Producing Thermonuclear Reactor"
@@ -104,25 +111,29 @@ def compute_temperature_profile(
     method: str = 'variance'
 ) -> Dict[str, float]:
     """
-    Compute "temperature" (diversity/entropy) at each hierarchical level.
+    Compute representation variance profile at each hierarchical level.
+
+    **Note**: "Temperature" here refers to representation variance/entropy, NOT thermal
+    temperature. The term is borrowed from fusion physics by analogy but represents a
+    fundamentally different quantity (statistical dispersion, not kinetic energy).
 
-    In fusion plasmas, temperature profiles T(r) determine confinement quality.
-    In neural networks, representation diversity serves analogous role:
-        - High T: Diverse, information-rich representations
-        - Low T: Collapsed, uniform representations
-        - Inverted profile (T_core < T_edge): Instability warning
+    In the fusion analogy: temperature profiles T(r) determine confinement quality.
+    In neural networks: representation variance serves structurally analogous role:
+        - High variance: Diverse, information-rich representations
+        - Low variance: Collapsed, uniform representations
+        - Inverted profile (variance decreasing with abstraction): Instability indicator
 
-    Temperature inversions predict collapse events (analogous to sawteeth oscillations).
+    Variance inversions empirically correlate with collapse events in NSM-33 experiments.
 
     Args:
         level_representations: Dict mapping level names to feature tensors
             e.g., {'L1': x_l1, 'L2': x_l2, 'L3': x_l3}
-        method: 'variance' or 'entropy' for temperature computation
+        method: 'variance' or 'entropy' for measurement
 
     Returns:
         Dict with:
-            - 'T_{level}': Temperature at each level
-            - 'T_gradient': Temperature gradient (L1 → L3)
+            - 'T_{level}': Variance/entropy at each level (NOT thermal temperature)
+            - 'T_gradient': Variance gradient (L1 → L3)
             - 'profile_type': 'normal', 'flat', or 'inverted'
     """
     temperatures = {}
@@ -133,10 +144,10 @@ def compute_temperature_profile(
             continue
 
         if method == 'variance':
-            # Variance-based temperature: Spread of representations
+            # Variance-based measurement: Spread of representations
             temp = x.var(dim=0).mean().item()
         elif method == 'entropy':
-            # Entropy-based temperature: Information content
+            # Entropy-based measurement: Information content
             # Use softmax to get probability distribution
             probs = torch.softmax(x, dim=-1)
             entropy = -(probs * torch.log(probs + 1e-8)).sum(dim=-1).mean().item()
@@ -146,7 +157,7 @@ def compute_temperature_profile(
 
         temperatures[f'T_{level_name}'] = temp
 
-    # Compute temperature gradient (should be positive: L1 < L2 < L3)
+    # Compute variance gradient (should be positive: L1 < L2 < L3 for healthy hierarchy)
     level_order = sorted([k for k in temperatures.keys() if k.startswith('T_L')])
     if len(level_order) >= 2:
         T_first = temperatures[level_order[0]]
diff --git a/nsm/utils/checkpoint_manager.py b/nsm/utils/checkpoint_manager.py
new file mode 100644
index 0000000..93ebb29
--- /dev/null
+++ b/nsm/utils/checkpoint_manager.py
@@ -0,0 +1,247 @@
+"""
+Checkpoint management utilities for NSM experiments.
+
+Provides consistent checkpoint saving/loading across local and Modal environments.
+"""
+
+import torch
+import json
+from pathlib import Path
+from typing import Dict, Optional, Any
+from datetime import datetime
+
+
+class CheckpointManager:
+    """
+    Manages model checkpoint saving and loading.
+
+    Features:
+    - Consistent format across experiments
+    - Metadata tracking (config, metrics, timestamp)
+    - Best model tracking
+    - Modal volume integration
+    """
+
+    def __init__(self, checkpoint_dir: str = "/checkpoints", experiment_name: str = "nsm"):
+        """
+        Initialize checkpoint manager.
+
+        Args:
+            checkpoint_dir: Directory for checkpoints (Modal volume path or local)
+            experiment_name: Experiment identifier
+        """
+        self.checkpoint_dir = Path(checkpoint_dir)
+        self.experiment_name = experiment_name
+        self.checkpoint_dir.mkdir(parents=True, exist_ok=True)
+
+    def save_checkpoint(
+        self,
+        model: torch.nn.Module,
+        epoch: int,
+        metrics: Dict[str, float],
+        config: Dict[str, Any],
+        optimizer: Optional[torch.optim.Optimizer] = None,
+        is_best: bool = False,
+        prefix: str = ""
+    ) -> Path:
+        """
+        Save model checkpoint with metadata.
+
+        Args:
+            model: PyTorch model
+            epoch: Current epoch number
+            metrics: Dictionary of validation metrics
+            config: Training configuration
+            optimizer: Optional optimizer state
+            is_best: Whether this is the best model so far
+            prefix: Optional prefix for checkpoint filename
+
+        Returns:
+            Path to saved checkpoint
+        """
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+
+        if prefix:
+            filename = f"{prefix}_{self.experiment_name}_epoch{epoch}_{timestamp}.pt"
+        else:
+            filename = f"{self.experiment_name}_epoch{epoch}_{timestamp}.pt"
+
+        checkpoint_path = self.checkpoint_dir / filename
+
+        checkpoint = {
+            'epoch': epoch,
+            'model_state_dict': model.state_dict(),
+            'metrics': metrics,
+            'config': config,
+            'timestamp': timestamp,
+            'experiment_name': self.experiment_name
+        }
+
+        if optimizer is not None:
+            checkpoint['optimizer_state_dict'] = optimizer.state_dict()
+
+        # Save checkpoint
+        torch.save(checkpoint, checkpoint_path)
+        print(f"💾 Saved checkpoint: {checkpoint_path}")
+
+        # Also save best model separately
+        if is_best:
+            best_path = self.checkpoint_dir / f"{self.experiment_name}_best.pt"
+            torch.save(checkpoint, best_path)
+            print(f"🌟 Saved best model: {best_path}")
+
+        # Save metadata JSON for easy inspection
+        metadata_path = checkpoint_path.with_suffix('.json')
+        metadata = {
+            'epoch': epoch,
+            'metrics': metrics,
+            'config': config,
+            'timestamp': timestamp,
+            'checkpoint_file': filename,
+            'is_best': is_best
+        }
+        with open(metadata_path, 'w') as f:
+            json.dump(metadata, f, indent=2, default=str)
+
+        return checkpoint_path
+
+    def load_checkpoint(
+        self,
+        checkpoint_path: Path,
+        model: torch.nn.Module,
+        optimizer: Optional[torch.optim.Optimizer] = None,
+        device: str = 'cpu'
+    ) -> Dict[str, Any]:
+        """
+        Load checkpoint into model.
+
+        Args:
+            checkpoint_path: Path to checkpoint file
+            model: Model to load weights into
+            optimizer: Optional optimizer to restore state
+            device: Device to map tensors to
+
+        Returns:
+            Checkpoint dictionary with metadata
+        """
+        checkpoint = torch.load(checkpoint_path, map_location=device)
+
+        model.load_state_dict(checkpoint['model_state_dict'])
+        print(f"✅ Loaded model from epoch {checkpoint['epoch']}")
+
+        if optimizer is not None and 'optimizer_state_dict' in checkpoint:
+            optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+            print(f"✅ Restored optimizer state")
+
+        return checkpoint
+
+    def load_best_checkpoint(
+        self,
+        model: torch.nn.Module,
+        optimizer: Optional[torch.optim.Optimizer] = None,
+        device: str = 'cpu'
+    ) -> Optional[Dict[str, Any]]:
+        """
+        Load best checkpoint for this experiment.
+
+        Args:
+            model: Model to load weights into
+            optimizer: Optional optimizer
+            device: Device to map to
+
+        Returns:
+            Checkpoint dict if found, None otherwise
+        """
+        best_path = self.checkpoint_dir / f"{self.experiment_name}_best.pt"
+
+        if not best_path.exists():
+            print(f"⚠️  No best checkpoint found at {best_path}")
+            return None
+
+        return self.load_checkpoint(best_path, model, optimizer, device)
+
+    def list_checkpoints(self) -> list:
+        """List all checkpoints for this experiment."""
+        pattern = f"{self.experiment_name}*.pt"
+        checkpoints = sorted(self.checkpoint_dir.glob(pattern))
+        return checkpoints
+
+    def get_latest_checkpoint(self) -> Optional[Path]:
+        """Get most recent checkpoint path."""
+        checkpoints = self.list_checkpoints()
+        if not checkpoints:
+            return None
+        return checkpoints[-1]
+
+
+def save_nsm_checkpoint(
+    model: torch.nn.Module,
+    epoch: int,
+    val_accuracy: float,
+    config: Dict[str, Any],
+    checkpoint_dir: str = "/checkpoints",
+    experiment_name: str = "nsm",
+    is_best: bool = False
+) -> Path:
+    """
+    Convenience function for NSM checkpoint saving.
+
+    Args:
+        model: NSM model
+        epoch: Training epoch
+        val_accuracy: Validation accuracy
+        config: Training config
+        checkpoint_dir: Checkpoint directory
+        experiment_name: Experiment name
+        is_best: Is this the best model?
+
+    Returns:
+        Path to saved checkpoint
+    """
+    manager = CheckpointManager(checkpoint_dir, experiment_name)
+
+    metrics = {'val_accuracy': val_accuracy}
+
+    return manager.save_checkpoint(
+        model=model,
+        epoch=epoch,
+        metrics=metrics,
+        config=config,
+        is_best=is_best
+    )
+
+
+def load_nsm_checkpoint(
+    model: torch.nn.Module,
+    checkpoint_path: str,
+    device: str = 'cpu'
+) -> Dict[str, Any]:
+    """
+    Convenience function for NSM checkpoint loading.
+
+    Args:
+        model: NSM model to load into
+        checkpoint_path: Path to checkpoint
+        device: Device to map to
+
+    Returns:
+        Checkpoint metadata
+    """
+    checkpoint_path = Path(checkpoint_path)
+
+    if not checkpoint_path.exists():
+        raise FileNotFoundError(f"Checkpoint not found: {checkpoint_path}")
+
+    # Infer experiment name from filename
+    experiment_name = checkpoint_path.stem.split('_')[0]
+    manager = CheckpointManager(checkpoint_path.parent, experiment_name)
+
+    return manager.load_checkpoint(checkpoint_path, model, device=device)
+
+
+# Export public API
+__all__ = [
+    'CheckpointManager',
+    'save_nsm_checkpoint',
+    'load_nsm_checkpoint'
+]
diff --git a/results/NSM-33_10x_validation_results.md b/results/NSM-33_10x_validation_results.md
index 9f4a47c..ba4166f 100644
--- a/results/NSM-33_10x_validation_results.md
+++ b/results/NSM-33_10x_validation_results.md
@@ -10,7 +10,9 @@
 
 ## Executive Summary
 
-Scaled validation at 10x dataset size (N≈14,000 vs N=2,000) confirms physics-inspired metrics provide actionable diagnostic value for neural class collapse prediction. All three experimental tracks demonstrated substantial improvements over the pilot baseline, with best validation accuracy increasing from 48.16% to 67.11% (+39.3% relative improvement). Physics-based adaptive control achieved superior class balance (Δ=2.28%), while diversity regularization successfully corrected the inverted temperature profile that plagued the pilot study.
+Scaled validation at 10x dataset size (N≈14,000 vs N=2,000) confirms physics-inspired empirical heuristics provide actionable diagnostic value for neural class collapse prediction. All three experimental tracks demonstrated substantial improvements over the pilot baseline, with best validation accuracy increasing from 48.16% to 67.11% (+39.3% relative improvement). Physics-based adaptive control achieved superior class balance (Δ=2.28%), while diversity regularization successfully corrected the inverted representation variance profile that plagued the pilot study.
+
+**Note on Terminology**: This document uses physics-inspired terminology (q_neural, "temperature" profile) for metrics that are **empirical heuristics** rather than rigorous physical isomorphisms. While structurally analogous to fusion plasma systems, dimensional analysis reveals these metrics lack true physical correspondence. They remain valuable predictive tools validated through experiment.
 
 **Key Findings**:
 - **Scale benefits confirmed**: 10x dataset increase yielded +15-18% absolute accuracy gains across all conditions
@@ -24,7 +26,7 @@ Scaled validation at 10x dataset size (N≈14,000 vs N=2,000) confirms physics-i
 
 **H1 (Track A - Scale)**: Scaling to N=20K will improve accuracy by ≥10% absolute
 **H2 (Track B - Adaptive)**: Physics-informed control will achieve better class balance than baseline
-**H3 (Track C - Temperature)**: Diversity regularization will correct inverted temperature profile
+**H3 (Track C - Variance Profile)**: Diversity regularization will correct inverted representation variance profile
 
 ### Hypothesis Outcomes
 - **H1**: ✅ **CONFIRMED** - Achieved +15.85% to +18.38% improvement (exceeded 10% threshold)
@@ -49,14 +51,15 @@ Scaled validation at 10x dataset size (N≈14,000 vs N=2,000) confirms physics-i
 | Class Balance Δ | 5.91% | -23.69% (improved) |
 | Training Epochs | 30 | Same |
 
-**Physics Metrics (Final Epoch)**:
+**Empirical Stability Metrics (Final Epoch)**:
 - **q_neural**: 1.336 [STABLE] - Above critical threshold (q > 1.0)
-- **Temperature Gradient**: 13.209 [NORMAL] - Positive gradient (T_L1 < T_L3)
+- **Variance Gradient**: 13.209 [NORMAL] - Positive gradient (T_L1 < T_L3)
 - **Lawson Q Factor**: 0.001 [SUBIGNITION] - Below ignition threshold
-- **Temperature Profile**: T_L1=0.381, T_L2=3.268, T_L3=13.590
+- **Representation Variance Profile**: T_L1=0.381, T_L2=3.268, T_L3=13.590
+  - Note: "T" denotes variance/entropy, not thermal temperature
 
 **Analysis**:
-Scale-up yielded dramatic improvement over pilot baseline (48.16% → 67.11%), confirming H1. Surprisingly, temperature profile normalized at scale without intervention, contrasting with pilot's persistent inversion. However, q_neural remained stable throughout training, suggesting larger datasets provide inherent regularization against collapse.
+Scale-up yielded dramatic improvement over pilot baseline (48.16% → 67.11%), confirming H1. Surprisingly, variance profile normalized at scale without intervention, contrasting with pilot's persistent inversion. However, q_neural remained stable throughout training, suggesting larger datasets provide inherent regularization against collapse.
 
 **Modal Experiment**: [ap-lxqvebfqwVMS3Pbbqd069W](https://modal.com/apps/research-developer/main/ap-lxqvebfqwVMS3Pbbqd069W)
 
diff --git a/scripts/download_checkpoints.py b/scripts/download_checkpoints.py
new file mode 100755
index 0000000..627fcef
--- /dev/null
+++ b/scripts/download_checkpoints.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python3
+"""
+Download checkpoints from Modal volume to local repo.
+
+Usage:
+    python scripts/download_checkpoints.py
+    python scripts/download_checkpoints.py --pattern "*best*"
+"""
+
+import subprocess
+import argparse
+from pathlib import Path
+
+
+def download_checkpoints(pattern: str = "*.pt", destination: str = "checkpoints"):
+    """Download checkpoints from Modal volume."""
+    dest_path = Path(destination)
+    dest_path.mkdir(parents=True, exist_ok=True)
+
+    print(f"📥 Downloading checkpoints matching '{pattern}' to {dest_path}/")
+
+    # List available checkpoints
+    print("\n🔍 Available checkpoints in Modal volume:")
+    result = subprocess.run(
+        ["modal", "volume", "ls", "nsm-checkpoints"],
+        capture_output=True,
+        text=True
+    )
+    print(result.stdout)
+
+    # Download checkpoints
+    cmd = [
+        "modal", "volume", "get",
+        "nsm-checkpoints",
+        str(dest_path)
+    ]
+
+    print(f"\n⬇️  Downloading...")
+    result = subprocess.run(cmd, capture_output=True, text=True)
+
+    if result.returncode == 0:
+        print("✅ Download complete!")
+
+        # List what we downloaded
+        checkpoints = list(dest_path.glob("*.pt"))
+        if checkpoints:
+            print(f"\n📦 Downloaded {len(checkpoints)} checkpoints:")
+            for cp in sorted(checkpoints):
+                size = cp.stat().st_size / (1024 * 1024)  # MB
+                print(f"   {cp.name} ({size:.1f} MB)")
+        else:
+            print("⚠️  No .pt files found in volume")
+
+        # Also check for JSON results
+        json_files = list(dest_path.glob("*.json"))
+        if json_files:
+            print(f"\n📄 Also found {len(json_files)} result files:")
+            for jf in sorted(json_files):
+                print(f"   {jf.name}")
+
+    else:
+        print(f"❌ Error: {result.stderr}")
+
+    return result.returncode == 0
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Download checkpoints from Modal")
+    parser.add_argument(
+        "--pattern",
+        default="*.pt",
+        help="Pattern to match checkpoint files"
+    )
+    parser.add_argument(
+        "--destination",
+        default="checkpoints",
+        help="Local destination directory"
+    )
+
+    args = parser.parse_args()
+
+    success = download_checkpoints(args.pattern, args.destination)
+    exit(0 if success else 1)