Skip to content

Latest commit

 

History

History
1042 lines (850 loc) · 38.9 KB

File metadata and controls

1042 lines (850 loc) · 38.9 KB

SIMPLE Design Document

Strategic Vision: Python-First API Design

CORE PRINCIPLE: Python serves as performance-first API prototype that demonstrates clean design patterns for subsequent Fortran API implementation, with eventual API convergence.

Four-Phase Strategic Roadmap

Phase 1: Python API Prototype (simple Module) - BATCH-ORIENTED HPC DESIGN

CORE ARCHITECTURE: Expose existing SoA performance through batch-oriented APIs

1.1 Module Structure and Class Hierarchy
simple/
├── __init__.py          # Public API: ParticleBatch, BatchResults, trace_orbits
├── core/
│   ├── __init__.py
│   ├── particles.py     # ParticleBatch class - wraps zstart(5, n_particles)
│   ├── results.py       # BatchResults class - wraps result arrays
│   ├── config.py        # Config class - replaces namelist globals
│   └── simulation.py    # High-level Simulation wrapper
├── backends/
│   ├── __init__.py
│   ├── fortran.py       # f90wrap interface to existing pysimple
│   └── native.py        # Future: Direct memory interface
├── samplers/
│   ├── __init__.py
│   ├── surface.py       # Surface sampling - wraps existing samplers.f90
│   ├── volume.py        # Volume sampling
│   └── file.py          # File-based initialization
└── utils/
    ├── __init__.py
    ├── memory.py        # Memory-efficient streaming utilities
    └── visualization.py # Batch visualization tools
1.2 Core Class Design - Zero-Copy SoA Wrappers

ParticleBatch Class: Direct wrapper around existing Fortran SoA arrays

class ParticleBatch:
    """Batch container for particle data using existing SoA layout"""
    
    def __init__(self, n_particles: int):
        """Initialize batch with existing zstart(5, n_particles) layout"""
        self.n_particles = n_particles
        # Direct allocation using existing Fortran allocation patterns
        self._backend = simple._backend.allocate_particle_arrays(n_particles)
        
    @property
    def positions(self) -> np.ndarray:
        """Zero-copy view of zstart array - Shape: (5, n_particles)"""
        # Direct memory view - no copying, existing SoA layout
        return self._backend.get_zstart_view()
    
    @property 
    def coordinates(self) -> Coordinates:
        """Structured access to particle coordinates"""
        pos = self.positions
        return Coordinates(
            s=pos[0, :],      # Flux surface coordinate
            theta=pos[1, :],  # Poloidal angle  
            phi=pos[2, :],    # Toroidal angle
            v_par=pos[3, :],  # Parallel velocity
            mu=pos[4, :]      # Magnetic moment
        )
    
    def initialize_surface(self, vmec_file: str, s: float, **kwargs):
        """Initialize particles on flux surface using existing samplers"""
        # Calls existing samplers.f90 functions
        simple._backend.sample_flux_surface(
            self._backend.arrays, vmec_file, s, self.n_particles, **kwargs
        )
    
    def initialize_volume(self, vmec_file: str, s_min: float, s_max: float):
        """Volume sampling using existing implementation"""
        simple._backend.sample_volume(
            self._backend.arrays, vmec_file, s_min, s_max, self.n_particles
        )
    
    def to_dict(self) -> Dict[str, np.ndarray]:
        """Export data for serialization/analysis"""
        return {
            'coordinates': self.positions.copy(),
            'n_particles': self.n_particles,
            'metadata': self._backend.get_metadata()
        }
    
    @classmethod
    def from_dict(cls, data: Dict) -> 'ParticleBatch':
        """Reconstruct from serialized data"""
        batch = cls(data['n_particles'])
        batch.positions[:] = data['coordinates']
        return batch

BatchResults Class: Wrapper around existing result arrays

class BatchResults:
    """Results container wrapping existing Fortran output arrays"""
    
    def __init__(self, backend_results):
        self._backend = backend_results
        self.n_particles = backend_results.n_particles
    
    @property
    def loss_times(self) -> np.ndarray:
        """Access to existing times_lost array - Shape: (n_particles,)"""
        return self._backend.get_times_lost_view()  # Zero-copy view
    
    @property
    def final_positions(self) -> np.ndarray:
        """Access to existing zend array - Shape: (5, n_particles)"""
        return self._backend.get_zend_view()  # Zero-copy view
    
    @property
    def trap_parameter(self) -> np.ndarray:
        """Access to existing trap_par array"""
        return self._backend.get_trap_par_view()
    
    @property
    def perpendicular_invariant(self) -> np.ndarray:
        """Access to existing perp_inv array"""
        return self._backend.get_perp_inv_view()
    
    def confinement_statistics(self) -> ConfinementStats:
        """Vectorized analysis of confinement using existing arrays"""
        lost_mask = self.loss_times < np.inf
        confined_mask = ~lost_mask
        
        return ConfinementStats(
            n_total=self.n_particles,
            n_confined=confined_mask.sum(),
            n_lost=lost_mask.sum(),
            confined_fraction=confined_mask.sum() / self.n_particles,
            mean_loss_time=self.loss_times[lost_mask].mean() if lost_mask.any() else np.inf,
            loss_time_distribution=np.histogram(self.loss_times[lost_mask], bins=50)
        )
    
    def save_hdf5(self, filename: str, append: bool = False):
        """Efficient HDF5 export of all result arrays"""
        import h5py
        mode = 'a' if append else 'w'
        with h5py.File(filename, mode) as f:
            grp = f.create_group(f'batch_{len(f.keys())}' if append else 'results')
            grp.create_dataset('loss_times', data=self.loss_times, compression='gzip')
            grp.create_dataset('final_positions', data=self.final_positions, compression='gzip')
            grp.create_dataset('trap_parameter', data=self.trap_parameter, compression='gzip')
            grp.create_dataset('perpendicular_invariant', data=self.perpendicular_invariant, compression='gzip')
1.3 High-Performance Batch Processing Function
def trace_orbits(
    particles: ParticleBatch,
    tmax: float,
    integrator: str = 'symplectic',
    openmp_threads: Optional[int] = None,
    memory_efficient: bool = False
) -> BatchResults:
    """
    Batch orbit tracing using existing OpenMP parallelized implementation
    
    Performance: <5% overhead vs direct Fortran execution
    Memory: Zero-copy access to existing SoA arrays
    Scaling: Leverages existing OpenMP thread-per-particle pattern
    """
    # Configure OpenMP threads (uses existing OpenMP infrastructure)
    if openmp_threads is not None:
        os.environ['OMP_NUM_THREADS'] = str(openmp_threads)
    
    # Set up configuration using existing parameter system
    config = {
        'tmax': tmax,
        'integmode': _integrator_map[integrator],
        'ntestpart': particles.n_particles,
        'memory_efficient': memory_efficient
    }
    
    # Call existing optimized Fortran implementation
    # Maps directly to existing simple_main.f90 structure
    result_arrays = simple._backend.run_simulation(
        particles._backend.arrays,  # Uses existing zstart allocation
        config
    )
    
    return BatchResults(result_arrays)
1.4 Memory-Efficient Streaming for Large Datasets
class ParticleBatchStream:
    """Memory-efficient streaming for processing millions of particles"""
    
    def __init__(self, total_particles: int, batch_size: int = 100_000):
        self.total_particles = total_particles
        self.batch_size = batch_size
        self.current_batch = 0
    
    def __iter__(self):
        return self
    
    def __next__(self) -> ParticleBatch:
        start_idx = self.current_batch * self.batch_size
        if start_idx >= self.total_particles:
            raise StopIteration
        
        end_idx = min(start_idx + self.batch_size, self.total_particles)
        actual_size = end_idx - start_idx
        
        # Create batch using existing allocation patterns
        batch = ParticleBatch(actual_size)
        self.current_batch += 1
        return batch

def process_large_simulation(
    vmec_file: str,
    n_total: int,
    tmax: float,
    batch_size: int = 100_000,
    output_file: str = 'results.h5'
) -> StreamResults:
    """
    Process millions of particles in memory-efficient batches
    
    Memory Usage: Constant - independent of total particle count
    Performance: Linear scaling with existing OpenMP optimization
    Output: Streaming HDF5 for large dataset handling
    """
    stream = ParticleBatchStream(n_total, batch_size)
    
    for i, batch in enumerate(stream):
        # Initialize using existing sampling
        batch.initialize_surface(vmec_file, s=0.9)
        
        # Process using existing optimized algorithms
        results = trace_orbits(batch, tmax=tmax)
        
        # Stream results to disk
        results.save_hdf5(output_file, append=(i > 0))
        
        # Clean memory between batches
        del batch, results
    
    return StreamResults(output_file, n_total)
1.5 GPU-Ready Architecture Preparation
class GPUParticleBatch(ParticleBatch):
    """GPU-optimized batch using existing SoA layout for future CUDA/OpenCL"""
    
    def __init__(self, n_particles: int, device: str = 'cuda:0'):
        super().__init__(n_particles)
        self.device = device
        # Future: GPU memory allocation preserving SoA layout
    
    def to_gpu(self) -> 'GPUParticleBatch':
        """Transfer to GPU maintaining SoA memory layout"""
        # Existing SoA layout is optimal for coalesced GPU memory access
        try:
            import cupy as cp
            gpu_positions = cp.asarray(self.positions)  # Efficient transfer
            return GPUParticleBatch._from_gpu_array(gpu_positions)
        except ImportError:
            raise RuntimeError("CuPy required for GPU operations")
    
    def from_gpu(self) -> ParticleBatch:
        """Efficient GPU->CPU transfer maintaining SoA layout"""
        cpu_positions = cupy.asnumpy(self._gpu_positions)
        batch = ParticleBatch(self.n_particles)
        batch.positions[:] = cpu_positions
        return batch

# Future GPU kernel interface design
def trace_orbits_gpu(particles: GPUParticleBatch, **kwargs) -> 'GPUBatchResults':
    """
    Future GPU implementation leveraging existing SoA layout
    
    Memory Pattern: positions[5, n_particles] - optimal for coalesced access
    Thread Mapping: One thread per particle - matches existing OpenMP pattern
    Algorithm: Identical to existing symplectic integrators
    """
    # GPU kernels will operate on existing SoA memory layout
    # positions[coordinate, particle] allows coalesced memory access
    # Each thread processes one particle - natural mapping from OpenMP
    pass
1.6 Integration with Existing f90wrap Interface
# backends/fortran.py - Wrapper around existing pysimple module
class FortranBackend:
    """Integration layer with existing f90wrap pysimple module"""
    
    def __init__(self):
        try:
            import pysimple  # Existing f90wrap module
            self.pysimple = pysimple
        except ImportError:
            raise RuntimeError("pysimple module not available - build with f90wrap")
    
    def allocate_particle_arrays(self, n_particles: int):
        """Use existing Fortran allocation for zstart arrays"""
        # Calls existing params_mod allocation routines
        self.pysimple.params_mod.ntestpart = n_particles
        self.pysimple.alloc_arrays()  # Existing allocation routine
        
        return FortranArrayWrapper(self.pysimple, n_particles)
    
    def run_simulation(self, arrays, config):
        """Execute simulation using existing simple_main structure"""
        # Set parameters using existing namelist system
        for key, value in config.items():
            setattr(self.pysimple.params_mod, key, value)
        
        # Call existing main simulation loop
        self.pysimple.run_simple()  # Existing OpenMP parallelized execution
        
        return FortranResultWrapper(self.pysimple)

class FortranArrayWrapper:
    """Zero-copy wrapper around existing Fortran arrays"""
    
    def __init__(self, pysimple_module, n_particles):
        self.pysimple = pysimple_module
        self.n_particles = n_particles
    
    def get_zstart_view(self) -> np.ndarray:
        """Direct view of existing zstart array"""
        # f90wrap provides direct memory access to Fortran arrays
        return self.pysimple.params_mod.zstart  # Shape: (5, ntestpart)
    
    def get_zend_view(self) -> np.ndarray:
        """Direct view of existing zend array"""
        return self.pysimple.params_mod.zend
    
    def get_times_lost_view(self) -> np.ndarray:
        """Direct view of existing times_lost array"""
        return self.pysimple.params_mod.times_lost
1.7 Performance Validation Framework
class PerformanceValidator:
    """Validate that Python API preserves existing HPC performance"""
    
    def validate_soa_layout(self, batch: ParticleBatch):
        """Verify SoA memory layout efficiency"""
        positions = batch.positions
        
        # Test 1: Contiguous memory layout
        assert positions.flags.c_contiguous, "SoA layout must be C-contiguous"
        
        # Test 2: Cache-friendly column access (existing optimization)
        self._benchmark_column_access(positions)
        
        # Test 3: Memory strides match existing Fortran layout
        self._validate_memory_strides(positions)
    
    def validate_openmp_scaling(self, particles: ParticleBatch):
        """Verify existing OpenMP performance is preserved"""
        # Test scaling with different thread counts
        for threads in [1, 2, 4, 8]:
            execution_time = self._benchmark_threads(particles, threads)
            # Validate scaling efficiency matches existing implementation
    
    def validate_memory_overhead(self, n_particles: int):
        """Ensure minimal memory overhead vs existing implementation"""
        baseline = self._measure_fortran_memory(n_particles)
        python_api = self._measure_python_memory(n_particles)
        
        overhead = (python_api - baseline) / baseline
        assert overhead < 0.05, f"Memory overhead too high: {overhead:.1%}"
    
    def golden_record_comparison(self):
        """Bit-identical results vs existing implementation"""
        # Run identical simulation with both APIs
        fortran_results = self._run_fortran_reference()
        python_results = self._run_python_api()
        
        # Results must be identical
        np.testing.assert_array_equal(
            fortran_results['times_lost'],
            python_results.loss_times,
            "Results must match existing implementation exactly"
        )

IMPLEMENTATION TIMELINE:

  • Week 1: Core ParticleBatch and BatchResults classes with f90wrap integration
  • Week 2: Memory-efficient streaming and performance validation framework
  • Week 3: GPU-ready architecture and comprehensive testing
  • Week 4: Documentation, optimization, and golden record validation

Phase 2: Python-Fortran Name Translation Layer

  • AUTOMATIC TRANSLATION: typename_tPEP8 convention mapping
  • TRANSPARENT: Users see only their preferred convention
  • SEAMLESS INTEGRATION: Clean interface between Python and Fortran

Phase 3: Fortran API Refactoring

  • PATTERN IMPLEMENTATION: Apply proven Python patterns to Fortran
  • NAMING CONVENTION: Fortran uses typename_t convention (config_t, simulation_t, orbit_t)
  • ATOMIC STEPS: Safe incremental refactoring preserving simple.x functionality
  • GOLDEN RECORD: Validation ensures no regressions

Phase 4: Python as Thin Wrapper

  • END STATE: Python becomes lightweight wrapper over clean Fortran API
  • PERFORMANCE: Remains in Fortran, Python provides convenience
  • API CONVERGENCE: Both interfaces expose same clean patterns

Current Architecture Analysis

Strengths

  • Performance: Highly optimized symplectic integrators with OpenMP parallelization
  • Scientific Accuracy: Proven algorithms for stellarator particle tracing
  • Robustness: Stable codebase with extensive real-world usage

API Modernization Needs

1. Global State Dependencies

! Current: Global module variables
module params
    real(dp) :: tmax, dtau
    integer :: nparticles, field_type
end module

! Target: Encapsulated configuration
type :: config_t
    real(dp) :: tmax, dtau
    integer :: nparticles, field_type
contains
    procedure :: from_namelist
    procedure :: validate
end type

2. Rigid Initialization Order

! Current: Rigid sequence with side effects
call read_config()
call init_field()  
call params_init()
call run()

! Target: Clean encapsulated interface
type(simulation_t) :: sim
sim = create_simulation(config)
results = sim%run()

3. Low-Level API Exposure

# Current: Complex multi-step setup
import pysimple
pysimple.read_config('simple.in')
pysimple.init_field()
pysimple.params_init()
pysimple.run()

# Target: Clean high-level interface  
import pysimple
config = {'vmec_file': 'wout.nc', 'nparticles': 10000}
sim = simple.Simulation(config)
results = sim.run()

Design Patterns and Conventions

Naming Translation Strategy

Python Convention (PEP8)

class Config:
    def from_dict(self, config_dict): pass
    def to_namelist_string(self): pass

class Simulation:
    def __init__(self, config): pass
    def run(self): pass

class Orbit:
    def trace(self, tmax): pass

Fortran Convention (typename_t)

type :: config_t
contains
    procedure :: from_dict => config_from_dict
    procedure :: to_namelist_string => config_to_namelist
end type

type :: simulation_t  
contains
    procedure :: init => simulation_init
    procedure :: run => simulation_run
end type

type :: orbit_t
contains
    procedure :: trace => orbit_trace
end type

Automatic Translation Layer

# Name translator handles convention mapping
class NameTranslator:
    def python_to_fortran(self, name):
        return f"{name.lower()}_t"
    
    def method_python_to_fortran(self, class_name, method_name):
        return f"{class_name.lower()}_{method_name}"

Core Module Architecture

Python Module Structure (simple/)

simple/
├── __init__.py          # Public API (Config, Simulation, Orbit)
├── config.py           # Configuration management prototype
├── simulation.py       # Batch simulation interface
├── orbit.py            # Single orbit analysis
├── translation.py      # Name convention translator
├── _backend.py         # f90wrap interface to Fortran
└── visualization.py    # Built-in plotting

Fortran Module Structure (Target)

src/api/
├── simple_config.f90      # config_t derived type
├── simple_simulation.f90  # simulation_t derived type  
├── simple_orbit.f90       # orbit_t derived type
├── simple_api.f90         # High-level API functions
└── simple_translation.f90 # Name translation utilities

Performance-First Design Principles

CRITICAL INSIGHT: Current Fortran implementation is ALREADY HPC/SoA optimized:

  • zstart(1:5, ntestpart) - Structure of Arrays pattern ✓
  • OpenMP particle parallelization ✓
  • Cache-friendly column access ✓
  • Million+ particle capability ✓

DESIGN GOAL: Create batch-oriented APIs around existing high-performance core.

1. Batch-Oriented API Design

# LEVERAGES EXISTING: SoA data structures, OpenMP parallelization
import pysimple

# Batch processing (primary use case)
particles = simple.ParticleBatch.from_vmec_surface(
    vmec_file='wout.nc',
    n_particles=100_000,  # Leverages existing SoA: zstart(5, n_particles)
    surface_s=0.9
)

# GPU-ready batch operations
results = simple.trace_orbits(
    particles,
    tmax=1000.0,
    integrator='symplectic_midpoint'  # Uses existing optimized integrators
)

# Direct access to underlying SoA arrays
positions = results.final_positions    # Shape: (5, n_particles) - existing zstart format
loss_times = results.loss_times        # Shape: (n_particles,) - existing times_lost format

2. GPU-Ready SoA Architecture Strategy

# LEVERAGE EXISTING: SoA layout perfect for GPU acceleration
class ParticleBatch:
    def to_gpu(self) -> 'GPUParticleBatch':
        # Existing SoA layout transfers efficiently to GPU
        gpu_positions = cupy.asarray(self.positions)  # Shape: (5, n_particles)
        return GPUParticleBatch(gpu_positions)
    
    def from_gpu(self) -> 'ParticleBatch':
        # Efficient GPU->CPU transfer maintaining SoA layout
        cpu_positions = cupy.asnumpy(self._gpu_positions)
        return ParticleBatch(cpu_positions)

# Future GPU kernel interface (leverages existing SoA)
def trace_orbits_gpu(particles: GPUParticleBatch, **kwargs):
    # GPU kernels operate on existing SoA memory layout
    # positions[5, n_particles] - coalesced memory access
    # One thread per particle - maps to existing OpenMP pattern
    pass

3. Batch Processing Optimization

# STREAM PROCESSING: Handle memory constraints efficiently
def process_large_simulation(n_total: int, batch_size: int = 100_000):
    """Process millions of particles in memory-efficient batches"""
    results = []
    
    for start_idx in range(0, n_total, batch_size):
        end_idx = min(start_idx + batch_size, n_total)
        batch_size_actual = end_idx - start_idx
        
        # Create batch using existing SoA allocation patterns
        batch = simple.ParticleBatch(n_particles=batch_size_actual)
        batch.initialize_surface(vmec_file='wout.nc', s=0.9)
        
        # Use existing OpenMP parallelization
        batch_results = simple.trace_orbits(batch, tmax=1000.0)
        results.append(batch_results)
        
        # Clean memory between batches
        del batch, batch_results
    
    return combine_batch_results(results)

4. Minimal Overhead Requirement

  • Python API must add <5% overhead to Fortran execution
  • Zero-copy array access to existing SoA structures
  • Configuration conversion cached for repeated use
  • Memory management aligned with existing Fortran patterns

3. Scientific Computing Integration

# Clean integration with scientific Python ecosystem
import pysimple
import numpy as np
import matplotlib.pyplot as plt

config = simple.Config.from_file('simple.in')
config.nparticles = 5000

sim = simple.Simulation(config)
results = sim.run()

# Direct NumPy array access to results
confined_fraction = results.confined_fraction  # NumPy array
loss_times = results.loss_times              # NumPy array

# Built-in visualization
results.plot_confinement()
results.plot_poincare_section()

Risk Assessment and Mitigation Strategy

TECHNICAL RISKS

Risk 1: f90wrap Integration Complexity

  • Impact: HIGH - Core dependency for zero-copy array access
  • Probability: MEDIUM - f90wrap can be brittle with complex Fortran types
  • Mitigation:
    • Early prototype with minimal f90wrap interface to validate feasibility
    • Fallback: Native ctypes interface if f90wrap proves problematic
    • Modular backend design allows swapping f90wrap for alternative approaches
    • Test with current pysimple module first to understand existing limitations

Risk 2: Memory Layout Compatibility

  • Impact: HIGH - Performance depends on zero-copy access to Fortran arrays
  • Probability: LOW - f90wrap generally preserves memory layout correctly
  • Mitigation:
    • Comprehensive memory layout validation tests in Performance Validation Framework
    • Memory stride verification to ensure SoA patterns are preserved
    • Fallback: Memory copying with performance warning if zero-copy fails
    • Early validation with small test arrays before scaling to full implementation

Risk 3: Performance Overhead

  • Impact: MEDIUM - <5% overhead requirement is strict
  • Probability: MEDIUM - Python function call overhead may accumulate
  • Mitigation:
    • Minimize Python-to-Fortran call frequency (batch operations, not per-particle)
    • Profile critical paths early and optimize hotspots
    • Use NumPy vectorized operations where possible
    • Consider Cython optimization for performance-critical wrapper code

Risk 4: OpenMP Thread Safety

  • Impact: MEDIUM - Existing OpenMP may not be thread-safe with Python
  • Probability: LOW - Existing pysimple likely already handles this correctly
  • Mitigation:
    • Validate OpenMP behavior through existing pysimple interface first
    • Implement thread-local storage for any Python state if needed
    • Document OpenMP thread count configuration clearly
    • Test with various thread counts to ensure deterministic results

SCHEDULE RISKS

Risk 5: f90wrap Learning Curve

  • Impact: MEDIUM - May delay initial implementation
  • Probability: MEDIUM - Complex f90wrap interfaces require expertise
  • Mitigation:
    • Start with existing pysimple module analysis to understand current patterns
    • Allocate extra time in Week 1 for f90wrap investigation
    • Consult f90wrap documentation and examples early
    • Consider bringing in f90wrap expertise if needed

Risk 6: Integration Testing Complexity

  • Impact: MEDIUM - Golden record validation may be time-consuming
  • Probability: MEDIUM - Ensuring bit-identical results is non-trivial
  • Mitigation:
    • Develop automated golden record test framework early
    • Start with simple test cases and build complexity gradually
    • Use existing examples/test cases as validation baseline
    • Plan extra time in Week 4 for comprehensive validation

QUALITY RISKS

Risk 7: Memory Leaks in Long-Running Simulations

  • Impact: MEDIUM - Could limit practical usability for large simulations
  • Probability: LOW - Proper resource management should prevent this
  • Mitigation:
    • Implement explicit resource cleanup in ParticleBatch.del()
    • Memory profiling tests for long-running batch processing
    • Clear documentation of memory management best practices
    • Stress testing with multiple batch creation/destruction cycles

Risk 8: API Usability Issues

  • Impact: MEDIUM - Poor API design could limit adoption
  • Probability: LOW - Design is based on proven batch processing patterns
  • Mitigation:
    • Early user feedback through prototype demonstrations
    • Comprehensive documentation with examples
    • Follow established scientific Python API patterns (NumPy, SciPy style)
    • Iterative refinement based on initial usage

Opportunity Analysis

PERFORMANCE OPPORTUNITIES

Opportunity 1: GPU Acceleration Foundation

  • Benefit: HIGH - SoA layout is optimal for future GPU implementation
  • Effort: LOW - Architecture already designed for GPU compatibility
  • Implementation:
    • Current SoA design provides perfect foundation for CUDA/OpenCL
    • Memory layout enables coalesced GPU memory access patterns
    • Thread-per-particle mapping natural for GPU kernels
    • Zero-copy GPU memory transfers with CuPy integration

Opportunity 2: Advanced Vectorization

  • Benefit: MEDIUM - NumPy vectorized operations on result arrays
  • Effort: LOW - Natural extension of batch processing design
  • Implementation:
    • Batch statistics calculations using vectorized NumPy operations
    • Efficient post-processing of large result arrays
    • Integration with SciPy for advanced statistical analysis
    • Parallel analysis across particle batches

Opportunity 3: Memory Optimization

  • Benefit: MEDIUM - More efficient memory usage for large simulations
  • Effort: MEDIUM - Requires careful design of streaming interfaces
  • Implementation:
    • Memory-mapped result files for out-of-core processing
    • Lazy loading of particle data from HDF5 files
    • Memory pool allocation for repeated batch processing
    • Automatic garbage collection tuning for large datasets

EFFICIENCY OPPORTUNITIES

Opportunity 4: Scientific Computing Ecosystem Integration

  • Benefit: HIGH - Seamless integration with broader scientific Python ecosystem
  • Effort: LOW - Natural result of clean API design
  • Implementation:
    • Direct NumPy array interfaces enable matplotlib, seaborn visualization
    • Pandas DataFrame export for data analysis workflows
    • Jupyter notebook integration with rich display methods
    • Integration with scientific workflow tools (Dask, Zarr)

Opportunity 5: Configuration Management

  • Benefit: MEDIUM - More flexible and powerful parameter management
  • Effort: LOW - Natural extension of batch-oriented design
  • Implementation:
    • YAML/JSON configuration files replace rigid namelist format
    • Parameter sweeps and optimization workflows
    • Configuration validation and documentation
    • Integration with hyperparameter optimization libraries

Opportunity 6: Advanced I/O Formats

  • Benefit: MEDIUM - Better data interchange and analysis capabilities
  • Effort: MEDIUM - Requires implementing multiple format handlers
  • Implementation:
    • HDF5 with rich metadata for large dataset handling
    • NetCDF integration for climate/fusion community standards
    • Parquet format for big data analytics workflows
    • Direct integration with data visualization tools

INNOVATION OPPORTUNITIES

Opportunity 7: Machine Learning Integration

  • Benefit: HIGH - Enable AI/ML workflows on particle data
  • Effort: MEDIUM - Requires ML-friendly data formats and interfaces
  • Implementation:
    • PyTorch/TensorFlow tensor interfaces for neural network training
    • Feature extraction pipelines for orbit classification
    • Transfer learning from simulation to experimental data
    • Reinforcement learning for control optimization

Opportunity 8: Real-Time Visualization

  • Benefit: MEDIUM - Interactive exploration of large simulation datasets
  • Effort: MEDIUM - Requires efficient streaming visualization
  • Implementation:
    • Bokeh/Plotly integration for interactive web visualizations
    • Real-time Poincaré section plotting during simulation
    • 3D trajectory visualization with ParaView integration
    • Progressive rendering for million-particle datasets

Opportunity 9: Cloud Computing Integration

  • Benefit: MEDIUM - Enable large-scale distributed simulations
  • Effort: HIGH - Requires distributed computing framework design
  • Implementation:
    • Dask integration for cluster computing
    • AWS/GCP batch processing workflows
    • Container-based deployment with optimal dependencies
    • Automatic scaling based on computational demand

Implementation Quality Standards

1. Atomic Refactoring Principle

Each implementation step must:

  • Preserve existing simple.x functionality exactly
  • Pass all golden record tests unchanged
  • Allow rollback if issues discovered
  • Be reviewable in isolation

2. Convention Consistency

  • Fortran: Strict typename_t convention throughout
  • Python: Strict PEP8 convention throughout
  • Translation: Automatic and transparent
  • Documentation: Consistent with target convention

3. Performance Validation: HPC Characteristics Preservation

SoA Layout Validation

# Verify existing SoA patterns are preserved and accessible
def validate_soa_performance():
    import pysimple
    import numpy as np
    import time
    
    # Test 1: SoA memory layout efficiency
    n_particles = 1_000_000
    batch = simple.ParticleBatch(n_particles)
    
    # Verify contiguous memory layout (existing pattern)
    positions = batch.positions  # Shape: (5, n_particles)
    assert positions.flags.c_contiguous, "SoA layout must be contiguous"
    
    # Test 2: Column access performance (existing cache pattern)
    start_time = time.time()
    for i in range(min(1000, n_particles)):
        particle_coords = positions[:, i]  # Cache-friendly column access
    column_access_time = time.time() - start_time
    
    # Test 3: Row access performance (should be slower)
    start_time = time.time()
    for j in range(5):
        coord_array = positions[j, :1000]  # Row access
    row_access_time = time.time() - start_time
    
    # Validate existing cache-friendly pattern is preserved
    assert column_access_time < row_access_time * 0.5, "Column access should be faster (SoA benefit)"

# Test 4: OpenMP scaling validation
def validate_openmp_scaling():
    """Verify existing OpenMP performance is preserved"""
    import pysimple
    import os
    
    results_1_thread = {}
    results_8_threads = {}
    
    # Test with 1 thread
    os.environ['OMP_NUM_THREADS'] = '1'
    start_time = time.time()
    result_1 = simple.trace_orbits(particles, tmax=100.0)
    results_1_thread['time'] = time.time() - start_time
    
    # Test with 8 threads 
    os.environ['OMP_NUM_THREADS'] = '8'
    start_time = time.time()
    result_8 = simple.trace_orbits(particles, tmax=100.0)
    results_8_threads['time'] = time.time() - start_time
    
    # Verify scaling efficiency (existing OpenMP patterns)
    scaling_efficiency = results_1_thread['time'] / (8 * results_8_threads['time'])
    assert scaling_efficiency > 0.6, f"OpenMP scaling too low: {scaling_efficiency}"
    
    # Verify results are identical (deterministic with existing algorithms)
    np.testing.assert_array_equal(result_1.loss_times, result_8.loss_times)

# Test 5: Memory efficiency validation
def validate_memory_efficiency():
    """Ensure no memory overhead vs existing implementation"""
    import psutil
    import pysimple
    
    # Baseline memory usage
    process = psutil.Process()
    baseline_memory = process.memory_info().rss
    
    # Create batch using new API
    n_particles = 100_000
    batch = simple.ParticleBatch(n_particles)
    api_memory = process.memory_info().rss
    
    # Expected memory for SoA arrays: 5 * n_particles * 8 bytes + additional arrays
    expected_particle_memory = 5 * n_particles * 8  # positions
    expected_particle_memory += n_particles * 8     # loss_times
    expected_particle_memory += n_particles * 8     # trap_par
    expected_particle_memory += n_particles * 8     # perp_inv
    
    actual_memory_increase = api_memory - baseline_memory
    overhead_ratio = actual_memory_increase / expected_particle_memory
    
    assert overhead_ratio < 1.1, f"Memory overhead too high: {overhead_ratio:.2f}"

Golden Record Performance Testing

def validate_golden_record_performance():
    """Compare against existing Fortran implementation performance"""
    
    # Performance benchmarks must match existing simple.x within 5%
    fortran_time = benchmark_fortran_simple()  # Existing implementation
    python_time = benchmark_python_api()       # New API
    
    performance_ratio = python_time / fortran_time
    assert performance_ratio < 1.05, f"Python API too slow: {performance_ratio:.3f}x"
    
    # Results must be bit-identical
    fortran_results = load_fortran_results()
    python_results = run_python_equivalent() 
    
    np.testing.assert_array_equal(
        fortran_results['times_lost'], 
        python_results.loss_times,
        "Results must be identical to existing implementation"
    )

def benchmark_memory_scaling():
    """Validate memory usage scales linearly with particle count"""
    particle_counts = [10_000, 100_000, 1_000_000]
    memory_usage = []
    
    for n in particle_counts:
        process = psutil.Process()
        before = process.memory_info().rss
        
        batch = simple.ParticleBatch(n)
        batch.initialize_surface('wout.nc', s=0.9)
        
        after = process.memory_info().rss
        memory_usage.append((after - before) / n)  # Per-particle memory
        
        del batch
    
    # Memory per particle should be constant (linear scaling)
    assert all(abs(m - memory_usage[0]) / memory_usage[0] < 0.1 for m in memory_usage), \
        "Memory scaling must be linear"

GPU Preparation Validation

def validate_gpu_readiness():
    """Ensure SoA layout is optimal for future GPU acceleration"""
    import pysimple
    
    batch = simple.ParticleBatch(100_000)
    positions = batch.positions  # Shape: (5, n_particles)
    
    # Test 1: Contiguous memory for efficient GPU transfer
    assert positions.flags.c_contiguous, "Must be C-contiguous for GPU transfer"
    
    # Test 2: Coalesced memory access pattern
    # GPU threads would access positions[:, thread_id] - coalesced
    for thread_block in range(0, 1000, 256):  # Simulate GPU thread blocks
        thread_ids = range(thread_block, min(thread_block + 256, 1000))
        particle_data = positions[:, thread_ids]  # Coalesced access pattern
        assert particle_data.shape[0] == 5, "Coordinate access must be contiguous"
    
    # Test 3: Memory layout matches GPU optimization requirements
    stride_coords = positions.strides[0]  # Stride between coordinates
    stride_particles = positions.strides[1]  # Stride between particles  
    
    # For coalesced access, particle stride should be smaller
    assert stride_particles == 8, "Particle stride optimal for GPU coalescing" 
    assert stride_coords == 8 * positions.shape[1], "Coordinate stride correct for SoA"

Success Metrics

Phase 1 (Python Prototype)

  • Performance: <5% overhead vs direct Fortran
  • Usability: Single orbit in <5 lines of code
  • Isolation: Multiple simulations without interference
  • Golden Record: Exact equivalence with existing implementation

Phase 2 (Name Translation)

  • Transparent: Users unaware of translation layer
  • Complete: All API elements correctly mapped
  • Consistent: Predictable naming patterns
  • Performance: No measurable translation overhead

Phase 3 (Fortran Refactoring)

  • API Parity: Fortran matches Python interface patterns
  • Convention: Consistent typename_t throughout
  • Compatibility: simple.x unchanged externally
  • Quality: Modern Fortran patterns applied

Phase 4 (Convergence)

  • Thin Wrapper: Python primarily delegates to Fortran
  • Performance: Near-native Fortran performance through Python
  • Maintenance: Single source of truth for API patterns
  • User Experience: Consistent interface across languages

Benefits of This Approach

1. Risk Mitigation

  • Python prototype validates patterns before Fortran changes
  • Atomic refactoring allows safe rollback
  • Golden record testing prevents regressions
  • User feedback guides API design

2. Development Efficiency

  • Python enables rapid API iteration
  • Patterns proven before costly Fortran implementation
  • Incremental value delivery throughout process
  • Clear blueprint for Fortran modernization

3. Long-term Maintainability

  • Modern API patterns in both languages
  • Consistent naming conventions
  • Clean separation of concerns
  • Foundation for future enhancements

This design creates a pathway from the current Fortran-centric codebase to a modern, dual-language API while preserving performance and maintaining backward compatibility throughout the transition.