diff --git a/docs/concepts/context.rst b/docs/concepts/context.rst new file mode 100644 index 00000000..d1b1e1a2 --- /dev/null +++ b/docs/concepts/context.rst @@ -0,0 +1,172 @@ +``ProtocolUnit`` execution ``Context`` +====================================== + +:class:`.Context` instances carry the execution environment for individual :class:`.ProtocolUnit` executions. +They are created by the execution engine just before a unit is excuted and discarded once the unit returns. +The class acts as a thin wrapper around two :class:`.StorageManager` objects (shared and permanent) and a scratch directory. + + +Why Context exists +------------------ + +``ProtocolUnit`` code frequently needs a few shared facilities: + +``scratch`` + A temporary directory that the unit can freely write to while it runs. + Files written here are considered ephemeral; the engine may delete them as + soon as the unit finishes. + +``shared`` + A :class:`.StorageManager` backed by + short–lived :class:`.ExternalStorage`. Use + this to hand large files to downstream units without serializing the + payloads through Python return values. + +``permanent`` + Another :class:`.StorageManager` targeting long–term storage. Results saved here + survive beyond the life of the :class:`.ProtocolDAG` run (for example for + inspection or for reuse in future extensions). + +``stdout`` / ``stderr`` + Optional directories where the engine captures subprocess output triggered + by the ``ProtocolUnit``. The directories are removed automatically when the context + closes. + +Keeping these handles bundled together and managed by a context manager lets +``ProtocolUnit`` implementers focus on domain logic while the engine ensures +storage gets flushed and temporary directories are cleaned. + + +Lifecycle +--------- + +``Context`` implements a `Python context manager`_. +When the context is exited, ``shared`` and ``permanent`` storage managers flushed tracked files back to their underlying :class:``ExternalStorage``. +Any ``stdout`` or ``stderr`` capture directories are also removed. + +.. _Python context manager: https://docs.python.org/3/reference/datamodel.html#context-managers + +This means each ``ProtocolUnit``'s ``shared`` and ``permanent`` object are not paths, and should not be treated as such. +Both of these are registries that track if a file should be transferred from its location in ``scratch`` to its final location after completing a unit. + +If you want to use some from ``shared`` or ``permanent``, you can use ``ctx.shared.load`` or ``ctx.permanent.load``. +This will allow your unit to fetch those objects from their storage for use. + + +Using Context inside ProtocolUnits +---------------------------------- + +Every :meth:`.ProtocolUnit._execute` definition must accept ``ctx: Context`` as its +first argument. Typical usage looks like the example below. + +.. code-block:: python + + from gufe import ProtocolUnit, Context + + class SimulationUnit(ProtocolUnit): + + @staticmethod + def _execute(ctx: Context, *, setup_result, lambda_window, settings): + scratch_path = ctx.scratch / f"lambda_{lambda_window}" + scratch_path.mkdir(exist_ok=True) + + # Read upstream artifacts from ctx.shared + system_file = ctx.shared.load(setup_result.outputs["system_file"]) + topology_file = ctx.shared.load(setup_result.outputs["topology_file"]) + + + result_path = ctx.scratch / "some_output.pdb" + # When you register the filename doesn't matter, + # just as long as you do it before you return + result_path_final_location = ctx.permanent.register(result_path) + # This is an example of running something that you want to save + simulate(output=result_path) + + # Return only lightweight metadata + return { + "lambda_window": lambda_window, + # We use this because it is already namespaced and can be used between units. + "result_path": result_path_final_location, + } + +The example above showcases how + + +Choosing between shared and permanent storage +--------------------------------------------- + +Both ``ctx.shared`` and ``ctx.permanent`` expose the same :class:`.StorageManager` +API but they serve different audiences: + +``ctx.shared`` + Optimized for communication between units in the same DAG execution. The + execution backend is free to prune these assets once no downstream unit + references them. + +``ctx.permanent`` + Intended for outputs that should survive beyond the immediate DAG, such as + user-facing reports or artifacts that will seed future runs. + +As a rule of thumb, prefer ``ctx.shared`` unless you have a clear requirement +to keep the data after the ``Protocol`` run concludes. Small scalar values or +lightweight metadata should still be returned directly from ``_execute`` so +they become part of the ``ProtocolUnitResult`` record. + + +Interaction with Protocols +-------------------------- + +``Protocol`` instances do not instantiate ``Context`` directly; they declare ``ProtocolUnit`` objects via ``Protocol._create``. +When an execution backend walks the resulting +``ProtocolDAG`` it constructs a ``Context`` for each unit using the DAG label, unit label, scratch directory, and the configured ``ExternalStorage`` implementations. +The backend might provide different ``ExternalStorage`` implementations (e.g., local filesystem, object store, cluster scratch) depending on where the work runs, but the ``Context`` API seen by ``ProtocolUnit`` authors stays consistent. + +Because the execution backend is in charge of creating the contexts, protocol authors can rely on ``ctx`` always being populated with valid storage managers and paths that are safe to write to from distributed workers. + + +Migrating from legacy Context usage +----------------------------------- + +Before ``Context`` was rewritten, it was a simple data class with two ``pathlib.Path`` handles: ``scratch`` and ``shared``. +Existing protocols adopted a variety of implicit conventions around those attributes. +Follow this checklist when migrating old protocols: + +1. **Swap file paths to StorageManager APIs.** Calls like ``ctx.shared / + "filename"`` should be replaced with the helper methods offered by + :class:`StorageManager`. For example: + +.. code-block:: python + + path = ctx.shared.scratch_dir / "myfile.dat" + +becomes: + +.. code-block:: python + + ctx.shared.register("myfile.dat") + + +2. **Avoid storing heavy objects in Python outputs.** Older protocols often + returned raw ``Path`` objects pointing at scratch files. Instead, register + the file with the storage manager and return the storage key (a string) from + ``_execute`` as shown in the example above. Downstream units can then call + :meth:`.StorageManager.load`. + +3. **Handle ctx.permanent.** There was no equivalent in the legacy API. + Decide which results must persist between DAG executions and write them via + ``ctx.permanent``. For migration you can start by mirroring whatever used + to live in ``ctx.shared`` and refine later. + +4. **Expect automatic cleanup.** Old contexts typically left stdout/stderr + directories around. The new context removes these when the unit finishes. + If your code tried to re-read the capture directories after ``_execute`` + returned, move that logic earlier or rely on the logged data captured in the + ``ProtocolUnitResult``. + +5. **Stop constructing Context manually.** + Some bespoke execution scripts once instantiated ``Context(scratch=..., shared=...)`` by hand. + That pattern is obsolete because the constructor now requires ``ExternalStorage`` objects. + Instead, rely on the execution backend to build contexts. + For unit tests use the helpers in ``gufe.storage.externalresource`` (e.g., ``MemoryStorage``) to create the necessary storage instances. + +If you need more information on how to use these concepts, checkout out: :doc:`../how-tos/protocol`. diff --git a/docs/concepts/index.rst b/docs/concepts/index.rst index c52ebaca..2f02c11a 100644 --- a/docs/concepts/index.rst +++ b/docs/concepts/index.rst @@ -12,4 +12,6 @@ utilize **gufe** APIs. tokenizables included_models serialization + storage + context logging diff --git a/docs/concepts/storage.rst b/docs/concepts/storage.rst new file mode 100644 index 00000000..59e4ed0b --- /dev/null +++ b/docs/concepts/storage.rst @@ -0,0 +1,213 @@ +.. _concepts-storage: + +How storage is handled in **gufe** +================================== + +**gufe** abstracts storage into a reusable interface using the :class:`.ExternalStorage` abstract base class. +This abstraction enables the storage of any file or byte stream using various storage backends without changing application code. + +Overview +-------- + +The storage system is designed to handle files (or byte data) that need to be stored in some location. +Instead of embedding the data, objects can store a reference (a unique string indicating the object's location, such as a path) to where the data is stored externally. This approach provides several benefits: + +* **Efficiency**: Large objects don't need to be serialized multiple times +* **Flexibility**: Different storage backends (local filesystem, cloud storage, in-memory) can be used interchangeably +* **Deduplication**: The same data can be referenced by multiple objects +* **Lazy Loading**: Data is only loaded when needed + +The Storage Architecture +------------------------- + +The storage system consists of several key components: + +``ExternalStorage`` Base Class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The :class:`.ExternalStorage` abstract base class defines the interface that all storage implementations must provide. This class provides: + +* **Store operations**: ``store_bytes()`` and ``store_path()`` to store data +* **Load operations**: ``load_stream()`` to retrieve data as a stream +* **Management**: ``exists()``, ``delete()``, and ``iter_contents()`` for managing stored data + +All storage operations use a *location* string as an identifier for the stored data. + +Storage Implementations +----------------------- + +**gufe** provides several built-in implementations of :class:`.ExternalStorage`: + +``FileStorage`` +~~~~~~~~~~~~~~~ + +The :class:`.FileStorage` implementation stores data on the local filesystem. It requires a root directory path and organizes stored files using the location string as a relative path: + +.. code-block:: python + + from pathlib import Path + from gufe.storage.externalresource import FileStorage + + # Create a file storage backend + storage = FileStorage(root_dir=Path("/path/to/storage")) + + # Store some data + data = b"Hello, World!" + storage.store_bytes("datasets/sample1.txt", data) + + # Check if data exists + if storage.exists("datasets/sample1.txt"): + # Load the data + with storage.load_stream("datasets/sample1.txt") as stream: + loaded_data = stream.read() + assert loaded_data == data + + # Delete the data + storage.delete("datasets/sample1.txt") + +``FileStorage`` automatically creates any necessary parent directories when storing files. + +``MemoryStorage`` +~~~~~~~~~~~~~~~~~ + +The :class:`.MemoryStorage` implementation stores data in a Python dictionary. This is primarily useful for testing and prototyping: + +.. code-block:: python + + from gufe.storage.externalresource import MemoryStorage + + # Create an in-memory storage backend + storage = MemoryStorage() + + # Store some data + data = b"Hello, World!" + storage.store_bytes("datasets/sample1.txt", data) + + # Load the data back + with storage.load_stream("datasets/sample1.txt") as stream: + loaded_data = stream.read() + +.. warning:: + ``MemoryStorage`` is not intended for production use and all data is lost when the Python process exits. + +Implementing Custom Storage Backends +------------------------------------- + +To create a custom storage backend, subclass :class:`.ExternalStorage` and implement all the abstract methods: + +.. code-block:: python + + from gufe.storage.externalresource.base import ExternalStorage + from typing import ContextManager + + class MyCustomStorage(ExternalStorage): + """A custom storage implementation.""" + + def _store_bytes(self, location: str, byte_data: bytes): + """Store bytes at the given location.""" + # Implement storage logic + pass + + def _store_path(self, location: str, path): + """Store a file at the given path.""" + # Implement storage logic + pass + + def _load_stream(self, location: str) -> ContextManager: + """Return a context manager that yields a bytes-like object.""" + # Implement loading logic + pass + + def _exists(self, location: str) -> bool: + """Check if data exists at the location.""" + # Implement existence check + pass + + def _delete(self, location: str): + """Delete data at the location.""" + # Implement deletion logic + pass + + def _get_filename(self, location: str) -> str: + """Return a filename for the location.""" + # Implement filename generation + pass + + def _iter_contents(self, prefix: str = ""): + """Iterate over stored locations matching the prefix.""" + # Implement iteration logic + pass + + def _get_hexdigest(self, location: str) -> str: + """Return MD5 hexdigest of the data (optional override).""" + # Can override for performance improvements + pass + +.. note:: + All storage methods should be blocking operations, even if the underlying storage backend supports asynchronous operations. + +StorageManager +-------------- + +The :class:`.StorageManager` class provides a higher-level interface for managing storage operations within a computational workflow. +.. note:: + + ``StorageManager`` is largely used by the :class:`.Context` class and should not be instantiated in protocols. + In general, protocol developers will only use the ``register`` and ``load`` functions. + +It handles the transfer of files between a scratch directory and external storage (such as shared or permanent storage): + +.. code-block:: python + + from pathlib import Path + from gufe.storage import StorageManager + from gufe.storage.externalresource import FileStorage + + # Set up storage + storage = FileStorage(root_dir=Path("/path/to/storage")) + scratch_dir = Path("/path/to/scratch") + + # Create a storage manager for a specific DAG and unit + manager = StorageManager( + scratch_dir=scratch_dir, + storage=storage, + dag_label="my_experiment", + unit_label="transformation_1" + ) + + # Register files for later transfer + out = manager.register("trajectory.dcd") + out2 = manager.register("results.json") + # Note: out and out2 are pre-namespaced values that allow storage items to be passed around + + # Transfer all registered files to external storage + manager._transfer() + + # Load files from external storage + trajectory_data = manager.load(out) + results_json = manager.load(out2) + +The ``StorageManager`` uses a namespace combining the ``dag_label`` and ``unit_label`` to organize files in the external storage backend. +To see how these work in practice see our documentation on :doc:`context`. + + +Error Handling +-------------- + +The storage system defines several exceptions in :mod:`gufe.storage.errors`: + +* :class:`.ExternalResourceError`: Base class for storage-related errors +* :class:`.MissingExternalResourceError`: Raised when attempting to access non-existent data +* :class:`.ChangedExternalResourceError`: Raised when metadata verification fails + +These exceptions can be caught and handled appropriately in application code: + +.. code-block:: python + + from gufe.storage.errors import MissingExternalResourceError + + try: + with storage.load_stream("nonexistent_file.txt") as stream: + data = stream.read() + except MissingExternalResourceError: + print("File not found in storage")