Sandboxing libstempo by vhaasteren · Pull Request #81 · vallis/libstempo

vhaasteren · 2025-10-10T17:42:33Z

Add Sandbox Mode for Crash-Protected libstempo Usage

Summary

This PR introduces a comprehensive sandbox mode for libstempo that provides crash isolation and automatic retry capabilities. The sandbox runs each tempopulsar instance in a separate subprocess, preventing tempo2 crashes from affecting the main Python kernel. Especially when running in scripts, processing many pulsars for (I)PTA purposes those random non-deterministic crashes that tempo2 tends to cause can be a pain, and the sandbox provides a frictionless workaround.

Notes

I have not tested this PR yet on a large number of devices, but it is fully operational. Test verifies against a native libstempo instance, and the github runners all finish. Demo notebooks work. Still, I am looking for bug reports! Only add_gwb is not compatible, because that function calls tempo2 natively.

Suitability for libstempo

I originally wrote this sandbox as part of another one of my projects (more IPTA-focused), which I will release soon. If the sandbox is deemed 'out of scope' for libstempo I'd be most happy to make it available through other means. To me the libstempo repo seems most sensible.

Key Features

🛡️ Crash Isolation

Segfaults in tempo2 only kill the worker process, not your main kernel
Automatic worker recycling prevents memory leaks and resource accumulation
Process isolation ensures stability for long-running analyses

🔄 Automatic Retry & Recovery

Built-in retry logic for transient failures
Configurable retry policies (constructor retries, call timeouts)
Automatic worker recycling based on age, call count, or memory usage

🌍 Environment Flexibility

Support for conda environments, virtual environments, and Rosetta (macOS)
Explicit Python path specification

🚀 Proactive TOA Handling

Automatically handles large TOA files to prevent "Too many TOAs" errors
Bulk loading capabilities for processing many pulsars

Usage

Basic Usage (Drop-in Replacement)

from libstempo.sandbox import tempopulsar

# Same API as regular tempopulsar
psr = tempopulsar(parfile="J1713.par", timfile="J1713.tim", dofit=False)
residuals = psr.residuals()
design_matrix = psr.designmatrix()

# Can just pass sandbox tempopulsar to native toasim as usual
import libstempo as lt
lt.make_ideal(psr)
lt.add_efac(psr, efac=1.0, seed=1234)

Advanced Configuration

from libstempo.sandbox import tempopulsar, Policy, configure_logging

# Configure logging and retry policies
configure_logging(level="DEBUG", log_file="tempo2.log")
policy = Policy(
    ctor_retry=5,           # Retry constructor 5 times on failure
    call_timeout_s=300.0,    # 5-minute timeout per RPC call
    max_calls_per_worker=1000,  # Recycle worker after 1000 calls
    max_age_s=3600,          # Recycle worker after 1 hour
    rss_soft_limit_mb=2048   # Recycle worker if memory exceeds 2GB
)

psr = tempopulsar(parfile="J1713.par", timfile="J1713.tim", policy=policy)

Bulk Processing

from libstempo.sandbox import load_many, Policy

pairs = [("J1713.par", "J1713.tim"), ("J1909.par", "J1909.tim"), ...]
policy = Policy(ctor_retry=3, call_timeout_s=120.0)

ok_by_name, retried_by_name, failed_list = load_many(pairs, policy=policy, parallel=8)

Performance Characteristics

Based on comprehensive performance testing with J1909-3744_NANOGrav_dfg+12 data:

Initialization: ~9x overhead (amortized over long-running applications)
Computational operations: ~1.2x overhead for residuals(), ~1.0x for designmatrix()
Attribute access: Higher overhead due to RPC, but typically not a bottleneck

The overhead is primarily due to inter-process communication, which is the price of process isolation. For heavy computations, the overhead becomes negligible relative to the actual work.

Implementation Details

Architecture

JSON-RPC over stdio: Robust communication protocol between main process and workers
Worker process management: Automatic lifecycle management with recycling policies
Data serialization: Efficient NumPy array copying to prevent memory sharing issues
Error handling: Exception types (Tempo2Error, Tempo2Crashed, Tempo2Timeout)

New Files Added

libstempo/sandbox.py (1,232 lines): Main sandbox implementation
libstempo/tim_file_analyzer.py (548 lines): TOA file analysis utilities
tests/test_sandbox.py (94 lines): Comprehensive test suite
Updated README.md with sandbox documentation

Testing

✅ All existing tests pass
✅ Comprehensive sandbox-specific tests added
✅ Performance analysis completed
✅ Cross-platform compatibility (somewhat) verified
❌ GWB simulation with add_gwb is not supported because it calls tempo2 natively --> Won't fix
✅ Fixed all bugs in half a year of usage

When to Use

Use Sandbox when:

Stability is critical (production environments)
Working with potentially unstable tempo2/libstempo versions
Need crash protection for long-running processes
Interactive environments (Jupyter notebooks)
Processing many pulsars in batch

Use Direct when:

Performance is critical
Development/testing environments
Stable, well-tested code

Backward Compatibility

This PR is made to be fully backward compatible. The sandbox is opt-in and doesn't affect existing code. All existing libstempo functionality remains unchanged.

Documentation

Updated README.md with comprehensive sandbox documentation
Inline docstrings with usage examples

Some irony

The github CI occasionally fails on the regular tests because of random segmentation faults.

- Break long lines in sandbox.py to fit 120-char limit - Add noqa comments for imports in __init__.py - Format tim_file_analyzer.py with black

…sts: parity + logs via savepar()

- Protocol - Add hello proto_version=1.2 and capabilities: get_kind, dir, setitem, get_slice, path_access - Non-exceptional attribute discovery (get-kind) and optional dir RPC - Array semantics - Introduce write-through ArrayProxy for numpy-backed attrs (stoas, toaerrs, freqs) - Reads expose plain numpy via __array__; __repr__/__str__/__getattr__ delegate to ndarray - Writes route via setitem RPC; add get_slice RPC to avoid fetching whole arrays for reads - Guard __len__ for 0-d; support fancy/masked indexing; optional safe dtype cast on set - Dotted paths - Gate first-hop mapping access to mapping-like (__getitem__) objects only - Support psr['PAR'].val/err/fit/set via dotted-path resolution - Process lifecycle & IO - Popen: pass env, close_fds, start_new_session, Windows CREATE_NEW_PROCESS_GROUP when available - Group kill with POSIX killpg; Windows terminate/kill fallbacks - Thread-safe RPC framing with a per-worker send lock - Errors & logging - Stderr ring with optional tail included in exceptions; cap tail by bytes (16KiB) and lines - Tests - Add unit tests comparing sandbox vs native for parameter mapping and TOA edits+fit - Full suite green: array writes now update worker; residuals match native after fit

…er recycle

… safe serialization

…ot installed. This fixes that bug.

vhaasteren · 2026-03-20T11:05:12Z

Hi @mattpitkin ,

It's been a while. I don't know how many people use this, but I do. I have fixed certain bugs now and then, and for me it's very usable. Especially in unit tests that would trip the github CI in other packages.

You think we can merge it?

mattpitkin

Looks good to me. I'd be happy to merge this, especially as it is standalone from the normal usage and shouldn't affect anything else. However, I'd like to give @vallis a couple of days to comment before we hit merge.

vhaasteren · 2026-03-20T11:29:31Z

Yes, that's reasonable. And indeed, it's fully separate, so nothing will break.

Thanks!

Bug fixes: - Fix RSS calculation always returning 0 on Linux (integer division order) - Fix load_many silently producing empty results when given a generator - Fix _rpc retry not handling setitem (would misroute as call after crash) Structural improvements: - Decompose _worker_stdio_main (430→60 line loop) into handler functions - Only recycle worker on Crashed/Timeout, not on all Tempo2Error subclasses - Route all attribute access through _get_kind_safe for crash recovery - Extract _dispatch_rpc so try/except retry paths are identical - Factor duplicated venv search paths into _venv_search_paths() Cleanup: - Remove dead _ArrayProxy.__del__ (referenced non-existent self._wp) - Fix _rpc_lock TOCTOU race by moving init to _WorkerProc.__init__ - Remove double-retry in load_many (tempopulsar already retries internally) - Make policy an explicit __init__ parameter instead of kwargs.pop - Move threading/collections to module-level imports - Drop Windows compat (libstempo is Linux/macOS only) - Replace blanket flake8 noqa: E501 with project .flake8 config New tests (16): - load_many: generator input, bad files, empty input - Live crash recovery: SIGKILL worker + auto-recover, state preservation - _ArrayProxy arithmetic: mul, add, sub, div, np.asarray, len/shape - Explicit policy parameter

vhaasteren added 3 commits October 10, 2025 17:20

feat: Added sandbox libstempo interface

498b136

Fix linting issues in sandbox implementation

b3549a3

Changed some comments

7f91592

vhaasteren marked this pull request as draft October 10, 2025 17:42

vhaasteren mentioned this pull request Oct 10, 2025

WIP: Attempt to fix quasi-random UTF-8 conversion failures/seg faults #80

Open

vhaasteren added 8 commits October 10, 2025 18:44

Fix pyproject.toml license format for PEP 621 compliance

506f198

Fix pyproject.toml license to use file-based format

456d83f

Fix linting issues: line length and unused imports

112c340

- Break long lines in sandbox.py to fit 120-char limit - Add noqa comments for imports in __init__.py - Format tim_file_analyzer.py with black

Trigger CI re-run

5d24e49

sandbox: stdio routing (FD1→FD2), stderr drain thread, logs(tail); te…

179e798

…sts: parity + logs via savepar()

sandbox: stdio routing (FD1→FD2), stderr drain thread, logs(tail); te…

2aa631a

…sts: parity + logs via savepar()

Added output of stdout/stderr to screen in real time

9cac16d

vhaasteren marked this pull request as ready for review October 12, 2025 19:33

vhaasteren added the enhancement label Oct 12, 2025

vhaasteren added 2 commits October 12, 2025 21:56

removed: old legacy RPC path

e16b10d

fix: sandbox did not have state management. Now it restores state aft…

c1e6755

…er recycle

vhaasteren requested a review from mattpitkin October 13, 2025 12:32

vhaasteren added 2 commits October 13, 2025 20:18

fix: pickling error for unpickleable psr attributes, solved by adding…

95a094f

… safe serialization

fix: more robust b64 decoding by first checking for encode attribute

268f0d1

vhaasteren self-assigned this Oct 14, 2025

vhaasteren added 2 commits December 2, 2025 15:07

fix: RPC-JSON communication was polluted by warning when astropy is n…

2841a4d

…ot installed. This fixes that bug.

Added array operations for _ArrayProxy

5420809

mattpitkin approved these changes Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandboxing libstempo#81

Sandboxing libstempo#81
vhaasteren wants to merge 18 commits intovallis:masterfrom
vhaasteren:sandbox

vhaasteren commented Oct 10, 2025 •

edited

Loading

Uh oh!

vhaasteren commented Mar 20, 2026

Uh oh!

mattpitkin left a comment

Uh oh!

vhaasteren commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vhaasteren commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Sandbox Mode for Crash-Protected libstempo Usage

Summary

Notes

Suitability for libstempo

Key Features

🛡️ Crash Isolation

🔄 Automatic Retry & Recovery

🌍 Environment Flexibility

🚀 Proactive TOA Handling

Usage

Basic Usage (Drop-in Replacement)

Advanced Configuration

Bulk Processing

Performance Characteristics

Implementation Details

Architecture

New Files Added

Testing

When to Use

Backward Compatibility

Documentation

Some irony

Uh oh!

vhaasteren commented Mar 20, 2026

Uh oh!

mattpitkin left a comment

Choose a reason for hiding this comment

Uh oh!

vhaasteren commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vhaasteren commented Oct 10, 2025 •

edited

Loading