pyshmem

pyshmem provides named shared-memory streams for NumPy arrays and optional CUDA-backed PyTorch pipelines.

It is designed for applications that need a small, predictable API for moving numeric payloads between processes without rebuilding the same locking, metadata, and lifecycle rules around raw shared memory.

Why pyshmem

one API for CPU NumPy buffers and CUDA-backed tensors
cross-process write locking with explicit lock ownership
safe snapshot reads for CPU streams
explicit GPU performance mode or CPU-mirrored compatibility mode
tested lifecycle and recovery behavior across supported platforms

Installation

Install from PyPI:

pip install pyshmem

Optional extras:

pip install pyshmem[test]
pip install pyshmem[gpu]
pip install pyshmem[docs]

For local development from a checkout:

pip install -e .[test]

Quick Start

CPU stream

import numpy as np
import pyshmem

writer = pyshmem.create("demo_frame", shape=(4, 4), dtype=np.float32)
reader = pyshmem.open("demo_frame")

writer.write(np.ones((4, 4), dtype=np.float32))
frame = reader.read()
next_frame = reader.read_new(timeout=1.0)

GPU stream

import numpy as np
import pyshmem

writer = pyshmem.create(
    "demo_cuda",
    shape=(4, 4),
    dtype=np.float32,
    gpu_device="cuda:0",
)
writer.write(np.ones((4, 4), dtype=np.float32))

reader = pyshmem.open("demo_cuda", gpu_device="cuda:0")
frame = reader.read()

Public API

pyshmem.SharedMemory
pyshmem.create(name, *, shape, dtype=np.float32, size=None, gpu_device=None, cpu_mirror=None)
pyshmem.open(name, *, gpu_device=None)
pyshmem.unlink(name)
pyshmem.gpu_available()

SharedMemory instances expose metadata, locking, lifecycle, and IO methods:

name, shape, dtype, size, gpu_device, cpu_mirror, owner
count, write_time, write_sequence
acquire(timeout=None, poll_interval=1e-3)
release()
locked(timeout=None, poll_interval=1e-3)
write(value)
read(safe=True, poll_interval=1e-6)
read_new(timeout=None, safe=True, poll_interval=1e-5)
clear()
close()
unlink()
delete()

Behavior Notes

Writes are serialized with a cross-platform file lock backend.

read(safe=True) returns a consistent snapshot of the most recent completed write
read(safe=False) exposes the live backing storage and therefore requires with shm.locked():
close() releases only the local handle
unlink() destroys the underlying shared-memory stream

Closed handles are guarded explicitly. After close(), methods such as read, write, acquire, clear, and metadata access raise a RuntimeError that instructs the caller to reopen the stream.

Missing segments raise FileNotFoundError with a pyshmem-specific message that points the caller toward pyshmem.create(...).

GPU Modes

GPU-backed streams have two deliberately different operating modes.

Performance mode:

pyshmem.create(..., gpu_device="cuda:N") defaults to cpu_mirror=False
avoids CPU mirror maintenance on every write
optimized for GPU-heavy pipelines where throughput matters most

Compatibility mode:

pyshmem.create(..., gpu_device="cuda:N", cpu_mirror=True) keeps the CPU mirror updated
allows CPU-side payload reads and stronger safe-snapshot semantics under concurrent writes

Important attachment rule:

pass gpu_device="cuda:N" to pyshmem.open(...) whenever the caller needs a CUDA torch.Tensor view
opening a GPU stream without gpu_device still allows metadata inspection and lock management, but payload reads require either a GPU attachment or cpu_mirror=True

Platform Notes

Windows limitation

Windows inherits a hard limitation from multiprocessing.shared_memory: the operating system deletes a shared-memory block as soon as the last handle to it is closed.

That means the following behaviors are unsupported on Windows:

a segment outliving its creator when no other process still has it open
close() followed by pyshmem.open(...) when that close() dropped the final live handle

Those behaviors remain supported on POSIX platforms.

Testing

Install test dependencies and run the CPU suite:

pip install -e .[test]
pytest -m cpu

Run the CUDA suite on a GPU machine:

pip install -e .[test,gpu]
pytest -m gpu

The repository also includes benchmark-marked tests:

pytest -m "cpu and benchmark" -q -s
pytest tests/test_benchmark.py -m "gpu and benchmark" -q -s

GitHub-hosted runners do not provide CUDA by default, so the CUDA workflow is manual and targets either a self-hosted GPU runner or a larger GitHub runner with CUDA support.

Performance

The benchmark suite measures both raw shared-memory IO and matrix-vector multiply pipelines that keep the matrix in shared memory.

Two GPU MVM shapes are covered:

host-upload pipeline: the vector payload is created in NumPy and uploaded each iteration
device-resident pipeline: the vector payload is produced directly on GPU each iteration

The CPU benchmark target remains 50 kHz for a 128x128 round trip. Hard enforcement is opt-in because hosted CI is not a reliable performance lab.

Measured Results

The following numbers were measured on this machine:

OS: Linux 6.17.0-14-generic x86_64
Python: 3.12.0
NumPy: 2.2.6
PyTorch: 2.10.0+cu128
GPU: NVIDIA GeForce RTX 5090

Methodology:

float32 payloads throughout
each benchmark case used warmup iterations before timing
each timed case ran for at least 1.5 seconds to reduce one-off noise
IO throughput is computed from write plus read bytes per iteration
MVM throughput is reported both as pipeline rate and estimated GFLOP/s using $2n^2$ floating-point operations per matrix-vector multiply

Important interpretation note:

GPU-backed segments now default to cpu_mirror=False
the fast GPU path avoids CPU mirror maintenance unless the creator explicitly asks for it with cpu_mirror=True
the stronger concurrent-read consistency contract is provided by the mirrored mode; the default no-mirror mode is optimized for throughput first
the GPU numbers below therefore reflect the optimized no-mirror path, which is the intended performance configuration

IO vs Image Size

Image size	Payload (MiB)	CPU roundtrip Hz	CPU IO (GB/s)	GPU roundtrip Hz	GPU IO (GB/s)
100x100	0.038	180311.2	14.42	36214.1	2.90
1000x1000	3.815	9922.1	79.38	5027.4	40.22
10000x10000	381.470	20.36	16.29	49.96	39.97

Shared-Memory MVM Pipeline

Host-upload GPU pipeline:

Matrix size	Matrix payload (MiB)	CPU pipeline Hz	CPU GFLOP/s	GPU pipeline Hz	GPU GFLOP/s
100x100	0.038	109844.4	2.20	26465.8	0.53
1000x1000	3.815	11124.9	22.25	22485.3	44.97
10000x10000	381.470	26.21	5.24	1299.3	259.86

Fully device-resident GPU pipeline:

Matrix size	Matrix payload (MiB)	GPU pipeline Hz	GPU GFLOP/s
100x100	0.038	30240.6	0.60
1000x1000	3.815	26733.6	53.47
10000x10000	381.470	1321.6	264.33

The updated results show the intended behavior for real GPU workloads:

tiny matrices like 100x100 are still dominated by launch and synchronization overhead, so CPU remains faster there
once the workload is large enough to matter, the no-mirror GPU path pulls ahead decisively
the 1000x1000 and 10000x10000 MVM cases now outperform the CPU equivalents by a wide margin on this machine
keeping the vector generation on GPU improves the pipeline further, especially once the matrix is large enough for the math to dominate

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
dist-release		dist-release
docs		docs
src/pyshmem		src/pyshmem
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyshmem

Why pyshmem

Installation

Quick Start

CPU stream

GPU stream

Public API

Behavior Notes

GPU Modes

Platform Notes

Windows limitation

Testing

Performance

Measured Results

IO vs Image Size

Shared-Memory MVM Pipeline

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pyshmem

Why pyshmem

Installation

Quick Start

CPU stream

GPU stream

Public API

Behavior Notes

GPU Modes

Platform Notes

Windows limitation

Testing

Performance

Measured Results

IO vs Image Size

Shared-Memory MVM Pipeline

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages