Skip to content

Sandboxing libstempo#81

Open
vhaasteren wants to merge 18 commits intovallis:masterfrom
vhaasteren:sandbox
Open

Sandboxing libstempo#81
vhaasteren wants to merge 18 commits intovallis:masterfrom
vhaasteren:sandbox

Conversation

@vhaasteren
Copy link
Copy Markdown
Collaborator

@vhaasteren vhaasteren commented Oct 10, 2025

Add Sandbox Mode for Crash-Protected libstempo Usage

Summary

This PR introduces a comprehensive sandbox mode for libstempo that provides crash isolation and automatic retry capabilities. The sandbox runs each tempopulsar instance in a separate subprocess, preventing tempo2 crashes from affecting the main Python kernel. Especially when running in scripts, processing many pulsars for (I)PTA purposes those random non-deterministic crashes that tempo2 tends to cause can be a pain, and the sandbox provides a frictionless workaround.

Notes

I have not tested this PR yet on a large number of devices, but it is fully operational. Test verifies against a native libstempo instance, and the github runners all finish. Demo notebooks work. Still, I am looking for bug reports! Only add_gwb is not compatible, because that function calls tempo2 natively.

Suitability for libstempo

I originally wrote this sandbox as part of another one of my projects (more IPTA-focused), which I will release soon. If the sandbox is deemed 'out of scope' for libstempo I'd be most happy to make it available through other means. To me the libstempo repo seems most sensible.

Key Features

🛡️ Crash Isolation

  • Segfaults in tempo2 only kill the worker process, not your main kernel
  • Automatic worker recycling prevents memory leaks and resource accumulation
  • Process isolation ensures stability for long-running analyses

🔄 Automatic Retry & Recovery

  • Built-in retry logic for transient failures
  • Configurable retry policies (constructor retries, call timeouts)
  • Automatic worker recycling based on age, call count, or memory usage

🌍 Environment Flexibility

  • Support for conda environments, virtual environments, and Rosetta (macOS)
  • Explicit Python path specification

🚀 Proactive TOA Handling

  • Automatically handles large TOA files to prevent "Too many TOAs" errors
  • Bulk loading capabilities for processing many pulsars

Usage

Basic Usage (Drop-in Replacement)

from libstempo.sandbox import tempopulsar

# Same API as regular tempopulsar
psr = tempopulsar(parfile="J1713.par", timfile="J1713.tim", dofit=False)
residuals = psr.residuals()
design_matrix = psr.designmatrix()

# Can just pass sandbox tempopulsar to native toasim as usual
import libstempo as lt
lt.make_ideal(psr)
lt.add_efac(psr, efac=1.0, seed=1234)

Advanced Configuration

from libstempo.sandbox import tempopulsar, Policy, configure_logging

# Configure logging and retry policies
configure_logging(level="DEBUG", log_file="tempo2.log")
policy = Policy(
    ctor_retry=5,           # Retry constructor 5 times on failure
    call_timeout_s=300.0,    # 5-minute timeout per RPC call
    max_calls_per_worker=1000,  # Recycle worker after 1000 calls
    max_age_s=3600,          # Recycle worker after 1 hour
    rss_soft_limit_mb=2048   # Recycle worker if memory exceeds 2GB
)

psr = tempopulsar(parfile="J1713.par", timfile="J1713.tim", policy=policy)

Bulk Processing

from libstempo.sandbox import load_many, Policy

pairs = [("J1713.par", "J1713.tim"), ("J1909.par", "J1909.tim"), ...]
policy = Policy(ctor_retry=3, call_timeout_s=120.0)

ok_by_name, retried_by_name, failed_list = load_many(pairs, policy=policy, parallel=8)

Performance Characteristics

Based on comprehensive performance testing with J1909-3744_NANOGrav_dfg+12 data:

  • Initialization: ~9x overhead (amortized over long-running applications)
  • Computational operations: ~1.2x overhead for residuals(), ~1.0x for designmatrix()
  • Attribute access: Higher overhead due to RPC, but typically not a bottleneck

The overhead is primarily due to inter-process communication, which is the price of process isolation. For heavy computations, the overhead becomes negligible relative to the actual work.

Implementation Details

Architecture

  • JSON-RPC over stdio: Robust communication protocol between main process and workers
  • Worker process management: Automatic lifecycle management with recycling policies
  • Data serialization: Efficient NumPy array copying to prevent memory sharing issues
  • Error handling: Exception types (Tempo2Error, Tempo2Crashed, Tempo2Timeout)

New Files Added

  • libstempo/sandbox.py (1,232 lines): Main sandbox implementation
  • libstempo/tim_file_analyzer.py (548 lines): TOA file analysis utilities
  • tests/test_sandbox.py (94 lines): Comprehensive test suite
  • Updated README.md with sandbox documentation

Testing

  • ✅ All existing tests pass
  • ✅ Comprehensive sandbox-specific tests added
  • ✅ Performance analysis completed
  • ✅ Cross-platform compatibility (somewhat) verified
  • ❌ GWB simulation with add_gwb is not supported because it calls tempo2 natively --> Won't fix
  • ✅ Fixed all bugs in half a year of usage

When to Use

Use Sandbox when:

  • Stability is critical (production environments)
  • Working with potentially unstable tempo2/libstempo versions
  • Need crash protection for long-running processes
  • Interactive environments (Jupyter notebooks)
  • Processing many pulsars in batch

Use Direct when:

  • Performance is critical
  • Development/testing environments
  • Stable, well-tested code

Backward Compatibility

This PR is made to be fully backward compatible. The sandbox is opt-in and doesn't affect existing code. All existing libstempo functionality remains unchanged.

Documentation

  • Updated README.md with comprehensive sandbox documentation
  • Inline docstrings with usage examples

Some irony

The github CI occasionally fails on the regular tests because of random segmentation faults.

- Break long lines in sandbox.py to fit 120-char limit
- Add noqa comments for imports in __init__.py
- Format tim_file_analyzer.py with black
- Protocol
  - Add hello proto_version=1.2 and capabilities: get_kind, dir, setitem, get_slice, path_access
  - Non-exceptional attribute discovery (get-kind) and optional dir RPC

- Array semantics
  - Introduce write-through ArrayProxy for numpy-backed attrs (stoas, toaerrs, freqs)
  - Reads expose plain numpy via __array__; __repr__/__str__/__getattr__ delegate to ndarray
  - Writes route via setitem RPC; add get_slice RPC to avoid fetching whole arrays for reads
  - Guard __len__ for 0-d; support fancy/masked indexing; optional safe dtype cast on set

- Dotted paths
  - Gate first-hop mapping access to mapping-like (__getitem__) objects only
  - Support psr['PAR'].val/err/fit/set via dotted-path resolution

- Process lifecycle & IO
  - Popen: pass env, close_fds, start_new_session, Windows CREATE_NEW_PROCESS_GROUP when available
  - Group kill with POSIX killpg; Windows terminate/kill fallbacks
  - Thread-safe RPC framing with a per-worker send lock

- Errors & logging
  - Stderr ring with optional tail included in exceptions; cap tail by bytes (16KiB) and lines

- Tests
  - Add unit tests comparing sandbox vs native for parameter mapping and TOA edits+fit
  - Full suite green: array writes now update worker; residuals match native after fit
@vhaasteren vhaasteren marked this pull request as ready for review October 12, 2025 19:33
@vhaasteren vhaasteren requested a review from mattpitkin October 13, 2025 12:32
@vhaasteren vhaasteren self-assigned this Oct 14, 2025
@vhaasteren
Copy link
Copy Markdown
Collaborator Author

Hi @mattpitkin ,

It's been a while. I don't know how many people use this, but I do. I have fixed certain bugs now and then, and for me it's very usable. Especially in unit tests that would trip the github CI in other packages.

You think we can merge it?

Copy link
Copy Markdown
Collaborator

@mattpitkin mattpitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'd be happy to merge this, especially as it is standalone from the normal usage and shouldn't affect anything else. However, I'd like to give @vallis a couple of days to comment before we hit merge.

@vhaasteren
Copy link
Copy Markdown
Collaborator Author

Yes, that's reasonable. And indeed, it's fully separate, so nothing will break.

Thanks!

Bug fixes:
- Fix RSS calculation always returning 0 on Linux (integer division order)
- Fix load_many silently producing empty results when given a generator
- Fix _rpc retry not handling setitem (would misroute as call after crash)

Structural improvements:
- Decompose _worker_stdio_main (430→60 line loop) into handler functions
- Only recycle worker on Crashed/Timeout, not on all Tempo2Error subclasses
- Route all attribute access through _get_kind_safe for crash recovery
- Extract _dispatch_rpc so try/except retry paths are identical
- Factor duplicated venv search paths into _venv_search_paths()

Cleanup:
- Remove dead _ArrayProxy.__del__ (referenced non-existent self._wp)
- Fix _rpc_lock TOCTOU race by moving init to _WorkerProc.__init__
- Remove double-retry in load_many (tempopulsar already retries internally)
- Make policy an explicit __init__ parameter instead of kwargs.pop
- Move threading/collections to module-level imports
- Drop Windows compat (libstempo is Linux/macOS only)
- Replace blanket flake8 noqa: E501 with project .flake8 config

New tests (16):
- load_many: generator input, bad files, empty input
- Live crash recovery: SIGKILL worker + auto-recover, state preservation
- _ArrayProxy arithmetic: mul, add, sub, div, np.asarray, len/shape
- Explicit policy parameter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants