Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions docs/concepts/context.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
``ProtocolUnit`` execution ``Context``
======================================

:class:`.Context` instances carry the execution environment for individual :class:`.ProtocolUnit` executions.
They are created by the execution engine just before a unit is excuted and discarded once the unit returns.
The class acts as a thin wrapper around two :class:`.StorageManager` objects (shared and permanent) and a scratch directory.


Why Context exists
------------------

``ProtocolUnit`` code frequently needs a few shared facilities:

``scratch``
A temporary directory that the unit can freely write to while it runs.
Files written here are considered ephemeral; the engine may delete them as
soon as the unit finishes.

``shared``
A :class:`.StorageManager` backed by
short–lived :class:`.ExternalStorage`. Use
this to hand large files to downstream units without serializing the
payloads through Python return values.

``permanent``
Another :class:`.StorageManager` targeting long–term storage. Results saved here
survive beyond the life of the :class:`.ProtocolDAG` run (for example for
inspection or for reuse in future extensions).

``stdout`` / ``stderr``
Optional directories where the engine captures subprocess output triggered
by the ``ProtocolUnit``. The directories are removed automatically when the context
closes.

Keeping these handles bundled together and managed by a context manager lets
``ProtocolUnit`` implementers focus on domain logic while the engine ensures
storage gets flushed and temporary directories are cleaned.


Lifecycle
---------

``Context`` implements a `Python context manager`_.
When the context is exited, ``shared`` and ``permanent`` storage managers flushed tracked files back to their underlying :class:``ExternalStorage``.
Any ``stdout`` or ``stderr`` capture directories are also removed.

.. _Python context manager: https://docs.python.org/3/reference/datamodel.html#context-managers

This means each ``ProtocolUnit``'s ``shared`` and ``permanent`` object are not paths, and should not be treated as such.
Both of these are registries that track if a file should be transferred from its location in ``scratch`` to its final location after completing a unit.

If you want to use some from ``shared`` or ``permanent``, you can use ``ctx.shared.load`` or ``ctx.permanent.load``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you want to use some from ``shared`` or ``permanent``, you can use ``ctx.shared.load`` or ``ctx.permanent.load``.
To access data from ``shared`` or ``permanent``, you can use ``ctx.shared.load`` or ``ctx.permanent.load``.

you were missing a word I think

This will allow your unit to fetch those objects from their storage for use.


Using Context inside ProtocolUnits
----------------------------------

Every :meth:`.ProtocolUnit._execute` definition must accept ``ctx: Context`` as its
first argument. Typical usage looks like the example below.

.. code-block:: python

from gufe import ProtocolUnit, Context

class SimulationUnit(ProtocolUnit):

@staticmethod
def _execute(ctx: Context, *, setup_result, lambda_window, settings):
scratch_path = ctx.scratch / f"lambda_{lambda_window}"
scratch_path.mkdir(exist_ok=True)

# Read upstream artifacts from ctx.shared
system_file = ctx.shared.load(setup_result.outputs["system_file"])
topology_file = ctx.shared.load(setup_result.outputs["topology_file"])


result_path = ctx.scratch / "some_output.pdb"
# When you register the filename doesn't matter,
# just as long as you do it before you return
result_path_final_location = ctx.permanent.register(result_path)
# This is an example of running something that you want to save
simulate(output=result_path)

# Return only lightweight metadata
return {
"lambda_window": lambda_window,
# We use this because it is already namespaced and can be used between units.
"result_path": result_path_final_location,
}

The example above showcases how
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete sentence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this.



Choosing between shared and permanent storage
---------------------------------------------

Both ``ctx.shared`` and ``ctx.permanent`` expose the same :class:`.StorageManager`
API but they serve different audiences:

``ctx.shared``
Optimized for communication between units in the same DAG execution. The
execution backend is free to prune these assets once no downstream unit
references them.

``ctx.permanent``
Intended for outputs that should survive beyond the immediate DAG, such as
user-facing reports or artifacts that will seed future runs.

As a rule of thumb, prefer ``ctx.shared`` unless you have a clear requirement
to keep the data after the ``Protocol`` run concludes. Small scalar values or
lightweight metadata should still be returned directly from ``_execute`` so
they become part of the ``ProtocolUnitResult`` record.


Interaction with Protocols
--------------------------

``Protocol`` instances do not instantiate ``Context`` directly; they declare ``ProtocolUnit`` objects via ``Protocol._create``.
When an execution backend walks the resulting
``ProtocolDAG`` it constructs a ``Context`` for each unit using the DAG label, unit label, scratch directory, and the configured ``ExternalStorage`` implementations.
The backend might provide different ``ExternalStorage`` implementations (e.g., local filesystem, object store, cluster scratch) depending on where the work runs, but the ``Context`` API seen by ``ProtocolUnit`` authors stays consistent.

Because the execution backend is in charge of creating the contexts, protocol authors can rely on ``ctx`` always being populated with valid storage managers and paths that are safe to write to from distributed workers.


Migrating from legacy Context usage
-----------------------------------

Before ``Context`` was rewritten, it was a simple data class with two ``pathlib.Path`` handles: ``scratch`` and ``shared``.
Existing protocols adopted a variety of implicit conventions around those attributes.
Follow this checklist when migrating old protocols:

1. **Swap file paths to StorageManager APIs.** Calls like ``ctx.shared /
"filename"`` should be replaced with the helper methods offered by
:class:`StorageManager`. For example:

.. code-block:: python

path = ctx.shared.scratch_dir / "myfile.dat"

becomes:

.. code-block:: python

ctx.shared.register("myfile.dat")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if your underlying code cannot handle Path like objects? For example, the multistate sampler stores the checkpoint file using a filename that is relative to the main storage path, so when you pass the file it's just checkpoint.chk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I get the full path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can use registration as well, however we assume that everything is stored in scratch. Remember that we explicitly make it such that the registered files are not Path-like objects to ensure that people don't use them as paths. One thing we could do, is remove the requirement that everything is stored in scratch and then you can register any file to be transferred at the end of the run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially add just dumping loaded objects as Pathlib objects.



2. **Avoid storing heavy objects in Python outputs.** Older protocols often
returned raw ``Path`` objects pointing at scratch files. Instead, register
the file with the storage manager and return the storage key (a string) from
``_execute`` as shown in the example above. Downstream units can then call
:meth:`.StorageManager.load`.

3. **Handle ctx.permanent.** There was no equivalent in the legacy API.
Decide which results must persist between DAG executions and write them via
``ctx.permanent``. For migration you can start by mirroring whatever used
to live in ``ctx.shared`` and refine later.

4. **Expect automatic cleanup.** Old contexts typically left stdout/stderr
directories around. The new context removes these when the unit finishes.
If your code tried to re-read the capture directories after ``_execute``
returned, move that logic earlier or rely on the logged data captured in the
``ProtocolUnitResult``.

5. **Stop constructing Context manually.**
Some bespoke execution scripts once instantiated ``Context(scratch=..., shared=...)`` by hand.
That pattern is obsolete because the constructor now requires ``ExternalStorage`` objects.
Instead, rely on the execution backend to build contexts.
For unit tests use the helpers in ``gufe.storage.externalresource`` (e.g., ``MemoryStorage``) to create the necessary storage instances.

If you need more information on how to use these concepts, checkout out: :doc:`../how-tos/protocol`.
2 changes: 2 additions & 0 deletions docs/concepts/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,6 @@ utilize **gufe** APIs.
tokenizables
included_models
serialization
storage
context
logging
213 changes: 213 additions & 0 deletions docs/concepts/storage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
.. _concepts-storage:

How storage is handled in **gufe**
==================================

**gufe** abstracts storage into a reusable interface using the :class:`.ExternalStorage` abstract base class.
This abstraction enables the storage of any file or byte stream using various storage backends without changing application code.

Overview
--------

The storage system is designed to handle files (or byte data) that need to be stored in some location.
Instead of embedding the data, objects can store a reference (a unique string indicating the object's location, such as a path) to where the data is stored externally. This approach provides several benefits:

* **Efficiency**: Large objects don't need to be serialized multiple times
* **Flexibility**: Different storage backends (local filesystem, cloud storage, in-memory) can be used interchangeably
* **Deduplication**: The same data can be referenced by multiple objects
* **Lazy Loading**: Data is only loaded when needed

The Storage Architecture
-------------------------

The storage system consists of several key components:

``ExternalStorage`` Base Class
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :class:`.ExternalStorage` abstract base class defines the interface that all storage implementations must provide. This class provides:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a get_filename method, but what if I want the full path / operate on the Pathlib object (i.e. if I want to get the file extension or the parent directory)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExternalStorage objects intentionally don't use Pathlib so that we can abstract over the storage medium. This allows execution engines the choice of how they want to store files (which might not always be a filesystem).


* **Store operations**: ``store_bytes()`` and ``store_path()`` to store data
* **Load operations**: ``load_stream()`` to retrieve data as a stream
* **Management**: ``exists()``, ``delete()``, and ``iter_contents()`` for managing stored data

All storage operations use a *location* string as an identifier for the stored data.

Storage Implementations
-----------------------

**gufe** provides several built-in implementations of :class:`.ExternalStorage`:

``FileStorage``
~~~~~~~~~~~~~~~

The :class:`.FileStorage` implementation stores data on the local filesystem. It requires a root directory path and organizes stored files using the location string as a relative path:

.. code-block:: python

from pathlib import Path
from gufe.storage.externalresource import FileStorage

# Create a file storage backend
storage = FileStorage(root_dir=Path("/path/to/storage"))

# Store some data
data = b"Hello, World!"
storage.store_bytes("datasets/sample1.txt", data)

# Check if data exists
if storage.exists("datasets/sample1.txt"):
# Load the data
with storage.load_stream("datasets/sample1.txt") as stream:
loaded_data = stream.read()
assert loaded_data == data

# Delete the data
storage.delete("datasets/sample1.txt")

``FileStorage`` automatically creates any necessary parent directories when storing files.

``MemoryStorage``
~~~~~~~~~~~~~~~~~

The :class:`.MemoryStorage` implementation stores data in a Python dictionary. This is primarily useful for testing and prototyping:

.. code-block:: python

from gufe.storage.externalresource import MemoryStorage

# Create an in-memory storage backend
storage = MemoryStorage()

# Store some data
data = b"Hello, World!"
storage.store_bytes("datasets/sample1.txt", data)

# Load the data back
with storage.load_stream("datasets/sample1.txt") as stream:
loaded_data = stream.read()

.. warning::
``MemoryStorage`` is not intended for production use and all data is lost when the Python process exits.

Implementing Custom Storage Backends
-------------------------------------

To create a custom storage backend, subclass :class:`.ExternalStorage` and implement all the abstract methods:

.. code-block:: python

from gufe.storage.externalresource.base import ExternalStorage
from typing import ContextManager

class MyCustomStorage(ExternalStorage):
"""A custom storage implementation."""

def _store_bytes(self, location: str, byte_data: bytes):
"""Store bytes at the given location."""
# Implement storage logic
pass

def _store_path(self, location: str, path):
"""Store a file at the given path."""
# Implement storage logic
pass

def _load_stream(self, location: str) -> ContextManager:
"""Return a context manager that yields a bytes-like object."""
# Implement loading logic
pass

def _exists(self, location: str) -> bool:
"""Check if data exists at the location."""
# Implement existence check
pass

def _delete(self, location: str):
"""Delete data at the location."""
# Implement deletion logic
pass

def _get_filename(self, location: str) -> str:
"""Return a filename for the location."""
# Implement filename generation
pass

def _iter_contents(self, prefix: str = ""):
"""Iterate over stored locations matching the prefix."""
# Implement iteration logic
pass

def _get_hexdigest(self, location: str) -> str:
"""Return MD5 hexdigest of the data (optional override)."""
# Can override for performance improvements
pass

.. note::
All storage methods should be blocking operations, even if the underlying storage backend supports asynchronous operations.

StorageManager
--------------

The :class:`.StorageManager` class provides a higher-level interface for managing storage operations within a computational workflow.
.. note::

``StorageManager`` is largely used by the :class:`.Context` class and should not be instantiated in protocols.
In general, protocol developers will only use the ``register`` and ``load`` functions.

It handles the transfer of files between a scratch directory and external storage (such as shared or permanent storage):

.. code-block:: python

from pathlib import Path
from gufe.storage import StorageManager
from gufe.storage.externalresource import FileStorage

# Set up storage
storage = FileStorage(root_dir=Path("/path/to/storage"))
scratch_dir = Path("/path/to/scratch")

# Create a storage manager for a specific DAG and unit
manager = StorageManager(
scratch_dir=scratch_dir,
storage=storage,
dag_label="my_experiment",
unit_label="transformation_1"
)

# Register files for later transfer
out = manager.register("trajectory.dcd")
out2 = manager.register("results.json")
# Note: out and out2 are pre-namespaced values that allow storage items to be passed around

# Transfer all registered files to external storage
manager._transfer()

# Load files from external storage
trajectory_data = manager.load(out)
results_json = manager.load(out2)

The ``StorageManager`` uses a namespace combining the ``dag_label`` and ``unit_label`` to organize files in the external storage backend.
To see how these work in practice see our documentation on :doc:`context`.


Error Handling
--------------

The storage system defines several exceptions in :mod:`gufe.storage.errors`:

* :class:`.ExternalResourceError`: Base class for storage-related errors
* :class:`.MissingExternalResourceError`: Raised when attempting to access non-existent data
* :class:`.ChangedExternalResourceError`: Raised when metadata verification fails

These exceptions can be caught and handled appropriately in application code:

.. code-block:: python

from gufe.storage.errors import MissingExternalResourceError

try:
with storage.load_stream("nonexistent_file.txt") as stream:
data = stream.read()
except MissingExternalResourceError:
print("File not found in storage")
Loading