-
Notifications
You must be signed in to change notification settings - Fork 12
docs: Context and Storage #712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/warehouse
Are you sure you want to change the base?
Changes from all commits
0800b03
5c428eb
edd1d8b
66a40fb
adbad03
94469fd
c4a910d
a2a4b80
355a4ce
497eb8c
e42fadf
200973a
ba76579
3fbe18f
8f1ce78
83ee929
be26dfc
bdc284a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,172 @@ | ||
| ``ProtocolUnit`` execution ``Context`` | ||
| ====================================== | ||
|
|
||
| :class:`.Context` instances carry the execution environment for individual :class:`.ProtocolUnit` executions. | ||
| They are created by the execution engine just before a unit is excuted and discarded once the unit returns. | ||
| The class acts as a thin wrapper around two :class:`.StorageManager` objects (shared and permanent) and a scratch directory. | ||
|
|
||
|
|
||
| Why Context exists | ||
| ------------------ | ||
|
|
||
| ``ProtocolUnit`` code frequently needs a few shared facilities: | ||
|
|
||
| ``scratch`` | ||
| A temporary directory that the unit can freely write to while it runs. | ||
| Files written here are considered ephemeral; the engine may delete them as | ||
| soon as the unit finishes. | ||
|
|
||
| ``shared`` | ||
| A :class:`.StorageManager` backed by | ||
| short–lived :class:`.ExternalStorage`. Use | ||
| this to hand large files to downstream units without serializing the | ||
| payloads through Python return values. | ||
|
|
||
| ``permanent`` | ||
| Another :class:`.StorageManager` targeting long–term storage. Results saved here | ||
| survive beyond the life of the :class:`.ProtocolDAG` run (for example for | ||
| inspection or for reuse in future extensions). | ||
|
|
||
| ``stdout`` / ``stderr`` | ||
| Optional directories where the engine captures subprocess output triggered | ||
| by the ``ProtocolUnit``. The directories are removed automatically when the context | ||
| closes. | ||
|
|
||
| Keeping these handles bundled together and managed by a context manager lets | ||
| ``ProtocolUnit`` implementers focus on domain logic while the engine ensures | ||
| storage gets flushed and temporary directories are cleaned. | ||
|
|
||
|
|
||
| Lifecycle | ||
| --------- | ||
|
|
||
| ``Context`` implements a `Python context manager`_. | ||
| When the context is exited, ``shared`` and ``permanent`` storage managers flushed tracked files back to their underlying :class:``ExternalStorage``. | ||
| Any ``stdout`` or ``stderr`` capture directories are also removed. | ||
|
|
||
| .. _Python context manager: https://docs.python.org/3/reference/datamodel.html#context-managers | ||
|
|
||
| This means each ``ProtocolUnit``'s ``shared`` and ``permanent`` object are not paths, and should not be treated as such. | ||
| Both of these are registries that track if a file should be transferred from its location in ``scratch`` to its final location after completing a unit. | ||
|
|
||
| If you want to use some from ``shared`` or ``permanent``, you can use ``ctx.shared.load`` or ``ctx.permanent.load``. | ||
| This will allow your unit to fetch those objects from their storage for use. | ||
|
|
||
|
|
||
| Using Context inside ProtocolUnits | ||
| ---------------------------------- | ||
|
|
||
| Every :meth:`.ProtocolUnit._execute` definition must accept ``ctx: Context`` as its | ||
| first argument. Typical usage looks like the example below. | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from gufe import ProtocolUnit, Context | ||
|
|
||
| class SimulationUnit(ProtocolUnit): | ||
|
|
||
| @staticmethod | ||
| def _execute(ctx: Context, *, setup_result, lambda_window, settings): | ||
| scratch_path = ctx.scratch / f"lambda_{lambda_window}" | ||
| scratch_path.mkdir(exist_ok=True) | ||
|
|
||
| # Read upstream artifacts from ctx.shared | ||
| system_file = ctx.shared.load(setup_result.outputs["system_file"]) | ||
| topology_file = ctx.shared.load(setup_result.outputs["topology_file"]) | ||
|
|
||
|
|
||
| result_path = ctx.scratch / "some_output.pdb" | ||
| # When you register the filename doesn't matter, | ||
atravitz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # just as long as you do it before you return | ||
| result_path_final_location = ctx.permanent.register(result_path) | ||
| # This is an example of running something that you want to save | ||
| simulate(output=result_path) | ||
|
|
||
| # Return only lightweight metadata | ||
| return { | ||
| "lambda_window": lambda_window, | ||
| # We use this because it is already namespaced and can be used between units. | ||
| "result_path": result_path_final_location, | ||
| } | ||
|
|
||
| The example above showcases how | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Incomplete sentence?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bumping this. |
||
|
|
||
|
|
||
| Choosing between shared and permanent storage | ||
| --------------------------------------------- | ||
|
|
||
| Both ``ctx.shared`` and ``ctx.permanent`` expose the same :class:`.StorageManager` | ||
| API but they serve different audiences: | ||
|
|
||
| ``ctx.shared`` | ||
| Optimized for communication between units in the same DAG execution. The | ||
atravitz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| execution backend is free to prune these assets once no downstream unit | ||
| references them. | ||
|
|
||
| ``ctx.permanent`` | ||
| Intended for outputs that should survive beyond the immediate DAG, such as | ||
| user-facing reports or artifacts that will seed future runs. | ||
|
|
||
| As a rule of thumb, prefer ``ctx.shared`` unless you have a clear requirement | ||
| to keep the data after the ``Protocol`` run concludes. Small scalar values or | ||
| lightweight metadata should still be returned directly from ``_execute`` so | ||
| they become part of the ``ProtocolUnitResult`` record. | ||
|
|
||
|
|
||
| Interaction with Protocols | ||
ethanholz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| -------------------------- | ||
|
|
||
| ``Protocol`` instances do not instantiate ``Context`` directly; they declare ``ProtocolUnit`` objects via ``Protocol._create``. | ||
| When an execution backend walks the resulting | ||
| ``ProtocolDAG`` it constructs a ``Context`` for each unit using the DAG label, unit label, scratch directory, and the configured ``ExternalStorage`` implementations. | ||
| The backend might provide different ``ExternalStorage`` implementations (e.g., local filesystem, object store, cluster scratch) depending on where the work runs, but the ``Context`` API seen by ``ProtocolUnit`` authors stays consistent. | ||
|
|
||
| Because the execution backend is in charge of creating the contexts, protocol authors can rely on ``ctx`` always being populated with valid storage managers and paths that are safe to write to from distributed workers. | ||
|
|
||
|
|
||
| Migrating from legacy Context usage | ||
| ----------------------------------- | ||
|
|
||
| Before ``Context`` was rewritten, it was a simple data class with two ``pathlib.Path`` handles: ``scratch`` and ``shared``. | ||
| Existing protocols adopted a variety of implicit conventions around those attributes. | ||
| Follow this checklist when migrating old protocols: | ||
|
|
||
| 1. **Swap file paths to StorageManager APIs.** Calls like ``ctx.shared / | ||
| "filename"`` should be replaced with the helper methods offered by | ||
| :class:`StorageManager`. For example: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| path = ctx.shared.scratch_dir / "myfile.dat" | ||
|
|
||
| becomes: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| ctx.shared.register("myfile.dat") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if your underlying code cannot handle Path like objects? For example, the multistate sampler stores the checkpoint file using a filename that is relative to the main storage path, so when you pass the file it's just
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do I get the full path?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you can use registration as well, however we assume that everything is stored in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Potentially add just dumping loaded objects as Pathlib objects. |
||
|
|
||
|
|
||
| 2. **Avoid storing heavy objects in Python outputs.** Older protocols often | ||
| returned raw ``Path`` objects pointing at scratch files. Instead, register | ||
| the file with the storage manager and return the storage key (a string) from | ||
| ``_execute`` as shown in the example above. Downstream units can then call | ||
| :meth:`.StorageManager.load`. | ||
|
|
||
| 3. **Handle ctx.permanent.** There was no equivalent in the legacy API. | ||
| Decide which results must persist between DAG executions and write them via | ||
| ``ctx.permanent``. For migration you can start by mirroring whatever used | ||
| to live in ``ctx.shared`` and refine later. | ||
|
|
||
| 4. **Expect automatic cleanup.** Old contexts typically left stdout/stderr | ||
| directories around. The new context removes these when the unit finishes. | ||
| If your code tried to re-read the capture directories after ``_execute`` | ||
| returned, move that logic earlier or rely on the logged data captured in the | ||
| ``ProtocolUnitResult``. | ||
|
|
||
| 5. **Stop constructing Context manually.** | ||
ethanholz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Some bespoke execution scripts once instantiated ``Context(scratch=..., shared=...)`` by hand. | ||
IAlibay marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| That pattern is obsolete because the constructor now requires ``ExternalStorage`` objects. | ||
| Instead, rely on the execution backend to build contexts. | ||
| For unit tests use the helpers in ``gufe.storage.externalresource`` (e.g., ``MemoryStorage``) to create the necessary storage instances. | ||
|
|
||
| If you need more information on how to use these concepts, checkout out: :doc:`../how-tos/protocol`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,4 +12,6 @@ utilize **gufe** APIs. | |
| tokenizables | ||
| included_models | ||
| serialization | ||
| storage | ||
| context | ||
| logging | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,213 @@ | ||
| .. _concepts-storage: | ||
|
|
||
| How storage is handled in **gufe** | ||
| ================================== | ||
|
|
||
| **gufe** abstracts storage into a reusable interface using the :class:`.ExternalStorage` abstract base class. | ||
| This abstraction enables the storage of any file or byte stream using various storage backends without changing application code. | ||
|
|
||
| Overview | ||
| -------- | ||
|
|
||
| The storage system is designed to handle files (or byte data) that need to be stored in some location. | ||
| Instead of embedding the data, objects can store a reference (a unique string indicating the object's location, such as a path) to where the data is stored externally. This approach provides several benefits: | ||
|
|
||
| * **Efficiency**: Large objects don't need to be serialized multiple times | ||
| * **Flexibility**: Different storage backends (local filesystem, cloud storage, in-memory) can be used interchangeably | ||
| * **Deduplication**: The same data can be referenced by multiple objects | ||
| * **Lazy Loading**: Data is only loaded when needed | ||
|
|
||
| The Storage Architecture | ||
| ------------------------- | ||
|
|
||
| The storage system consists of several key components: | ||
|
|
||
| ``ExternalStorage`` Base Class | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The :class:`.ExternalStorage` abstract base class defines the interface that all storage implementations must provide. This class provides: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can see a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| * **Store operations**: ``store_bytes()`` and ``store_path()`` to store data | ||
| * **Load operations**: ``load_stream()`` to retrieve data as a stream | ||
| * **Management**: ``exists()``, ``delete()``, and ``iter_contents()`` for managing stored data | ||
IAlibay marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| All storage operations use a *location* string as an identifier for the stored data. | ||
|
|
||
| Storage Implementations | ||
| ----------------------- | ||
|
|
||
| **gufe** provides several built-in implementations of :class:`.ExternalStorage`: | ||
|
|
||
| ``FileStorage`` | ||
| ~~~~~~~~~~~~~~~ | ||
|
|
||
| The :class:`.FileStorage` implementation stores data on the local filesystem. It requires a root directory path and organizes stored files using the location string as a relative path: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from pathlib import Path | ||
| from gufe.storage.externalresource import FileStorage | ||
|
|
||
| # Create a file storage backend | ||
| storage = FileStorage(root_dir=Path("/path/to/storage")) | ||
|
|
||
| # Store some data | ||
| data = b"Hello, World!" | ||
| storage.store_bytes("datasets/sample1.txt", data) | ||
|
|
||
| # Check if data exists | ||
| if storage.exists("datasets/sample1.txt"): | ||
| # Load the data | ||
| with storage.load_stream("datasets/sample1.txt") as stream: | ||
| loaded_data = stream.read() | ||
| assert loaded_data == data | ||
|
|
||
| # Delete the data | ||
| storage.delete("datasets/sample1.txt") | ||
|
|
||
| ``FileStorage`` automatically creates any necessary parent directories when storing files. | ||
|
|
||
| ``MemoryStorage`` | ||
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The :class:`.MemoryStorage` implementation stores data in a Python dictionary. This is primarily useful for testing and prototyping: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from gufe.storage.externalresource import MemoryStorage | ||
|
|
||
| # Create an in-memory storage backend | ||
| storage = MemoryStorage() | ||
|
|
||
| # Store some data | ||
| data = b"Hello, World!" | ||
| storage.store_bytes("datasets/sample1.txt", data) | ||
|
|
||
| # Load the data back | ||
| with storage.load_stream("datasets/sample1.txt") as stream: | ||
| loaded_data = stream.read() | ||
|
|
||
| .. warning:: | ||
| ``MemoryStorage`` is not intended for production use and all data is lost when the Python process exits. | ||
|
|
||
| Implementing Custom Storage Backends | ||
| ------------------------------------- | ||
|
|
||
| To create a custom storage backend, subclass :class:`.ExternalStorage` and implement all the abstract methods: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from gufe.storage.externalresource.base import ExternalStorage | ||
| from typing import ContextManager | ||
|
|
||
| class MyCustomStorage(ExternalStorage): | ||
| """A custom storage implementation.""" | ||
|
|
||
| def _store_bytes(self, location: str, byte_data: bytes): | ||
| """Store bytes at the given location.""" | ||
| # Implement storage logic | ||
| pass | ||
|
|
||
| def _store_path(self, location: str, path): | ||
atravitz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """Store a file at the given path.""" | ||
| # Implement storage logic | ||
| pass | ||
|
|
||
| def _load_stream(self, location: str) -> ContextManager: | ||
| """Return a context manager that yields a bytes-like object.""" | ||
| # Implement loading logic | ||
| pass | ||
|
|
||
| def _exists(self, location: str) -> bool: | ||
| """Check if data exists at the location.""" | ||
| # Implement existence check | ||
| pass | ||
|
|
||
| def _delete(self, location: str): | ||
| """Delete data at the location.""" | ||
| # Implement deletion logic | ||
| pass | ||
|
|
||
| def _get_filename(self, location: str) -> str: | ||
| """Return a filename for the location.""" | ||
| # Implement filename generation | ||
| pass | ||
|
|
||
| def _iter_contents(self, prefix: str = ""): | ||
| """Iterate over stored locations matching the prefix.""" | ||
| # Implement iteration logic | ||
| pass | ||
|
|
||
| def _get_hexdigest(self, location: str) -> str: | ||
| """Return MD5 hexdigest of the data (optional override).""" | ||
| # Can override for performance improvements | ||
| pass | ||
|
|
||
| .. note:: | ||
| All storage methods should be blocking operations, even if the underlying storage backend supports asynchronous operations. | ||
|
|
||
| StorageManager | ||
| -------------- | ||
|
|
||
| The :class:`.StorageManager` class provides a higher-level interface for managing storage operations within a computational workflow. | ||
| .. note:: | ||
|
|
||
| ``StorageManager`` is largely used by the :class:`.Context` class and should not be instantiated in protocols. | ||
| In general, protocol developers will only use the ``register`` and ``load`` functions. | ||
|
|
||
| It handles the transfer of files between a scratch directory and external storage (such as shared or permanent storage): | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from pathlib import Path | ||
| from gufe.storage import StorageManager | ||
| from gufe.storage.externalresource import FileStorage | ||
|
|
||
| # Set up storage | ||
| storage = FileStorage(root_dir=Path("/path/to/storage")) | ||
| scratch_dir = Path("/path/to/scratch") | ||
|
|
||
| # Create a storage manager for a specific DAG and unit | ||
| manager = StorageManager( | ||
| scratch_dir=scratch_dir, | ||
| storage=storage, | ||
| dag_label="my_experiment", | ||
| unit_label="transformation_1" | ||
| ) | ||
|
|
||
| # Register files for later transfer | ||
| out = manager.register("trajectory.dcd") | ||
| out2 = manager.register("results.json") | ||
| # Note: out and out2 are pre-namespaced values that allow storage items to be passed around | ||
|
|
||
| # Transfer all registered files to external storage | ||
| manager._transfer() | ||
|
|
||
| # Load files from external storage | ||
| trajectory_data = manager.load(out) | ||
| results_json = manager.load(out2) | ||
|
|
||
| The ``StorageManager`` uses a namespace combining the ``dag_label`` and ``unit_label`` to organize files in the external storage backend. | ||
| To see how these work in practice see our documentation on :doc:`context`. | ||
|
|
||
|
|
||
| Error Handling | ||
| -------------- | ||
|
|
||
| The storage system defines several exceptions in :mod:`gufe.storage.errors`: | ||
|
|
||
| * :class:`.ExternalResourceError`: Base class for storage-related errors | ||
| * :class:`.MissingExternalResourceError`: Raised when attempting to access non-existent data | ||
| * :class:`.ChangedExternalResourceError`: Raised when metadata verification fails | ||
|
|
||
| These exceptions can be caught and handled appropriately in application code: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from gufe.storage.errors import MissingExternalResourceError | ||
|
|
||
| try: | ||
| with storage.load_stream("nonexistent_file.txt") as stream: | ||
| data = stream.read() | ||
| except MissingExternalResourceError: | ||
| print("File not found in storage") | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you were missing a word I think