-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Before submitting
- I've searched open issues and found no similar request
- I'm willing to start a discussion or contribute code
Problem / motivation
Currently, most of the Stuff data is stored in-memory. It's the case for all objects except a few classes like ImageContent which can hold bytes or a URL.
This is a problem for several reasons, but the main is that Pipelex workflows should be usable in a lean orchestrated way. The logic and tracking of the workflow (concepts passed from pipe to pipe) should be as light as possible. Pipelex should be able to work with potentially very large data, like videos, and passing it around in and out of pipes is not appropriate.
Proposed solution
We have already added a StorageProvider to Pipelex's main singleton. It can be customized using dependency injection by providing a class that respects the StorageProviderAbstract interface. It's only two methods:
class StorageProviderAbstract(ABC):
@abstractmethod
def load(self, uri: str) -> bytes:
pass
@abstractmethod
def store(self, data: bytes) -> str:
pass
The planned feature consists in using the active storage provider, available from pipelex.hub via get_storage_provider(), and systematically load/store the inputs/outputs of pipes. This means every StuffContent's data should be substitutable by a URI.
The concrete implementation of StorageProviderAbstract will be responsible for defining the URI and trading it for data. For instance it could use local storage with a path, or it could use online storage such as Amazon S3 or Google Storage to store blobs and use their bucket/blob_id as URI.
Note: for structured objects (BaseModels), serialization/deserialization will be the responsibility of our other open-source library Kajson which is already a dependency of Pipelex.
Obviously this will add overhead for loading/storage when entering/leaving every call to run_pipe(). But in many cases this will be faster than inference and anyway it's the way to durable resilient workflows. That said, the first version of this feature should be based on an in-memory StorageProviderAbstract to avoid the overhead and enable testing the load/store logic with minimum dependencies. The second way should be local storage, and that will make it easy and natural to use or dispose of any generated stuff at the end of the pipeline.
Alternatives considered
No response
Would you like to help implement this feature?
None