Skip to content

WorkingMemory storage management #160

@lchoquel

Description

@lchoquel

Before submitting

  • I've searched open issues and found no similar request
  • I'm willing to start a discussion or contribute code

Problem / motivation

Currently, most of the Stuff data is stored in-memory. It's the case for all objects except a few classes like ImageContent which can hold bytes or a URL.
This is a problem for several reasons, but the main is that Pipelex workflows should be usable in a lean orchestrated way. The logic and tracking of the workflow (concepts passed from pipe to pipe) should be as light as possible. Pipelex should be able to work with potentially very large data, like videos, and passing it around in and out of pipes is not appropriate.

Proposed solution

We have already added a StorageProvider to Pipelex's main singleton. It can be customized using dependency injection by providing a class that respects the StorageProviderAbstract interface. It's only two methods:

class StorageProviderAbstract(ABC):
    @abstractmethod
    def load(self, uri: str) -> bytes:
        pass

    @abstractmethod
    def store(self, data: bytes) -> str:
        pass

The planned feature consists in using the active storage provider, available from pipelex.hub via get_storage_provider(), and systematically load/store the inputs/outputs of pipes. This means every StuffContent's data should be substitutable by a URI.
The concrete implementation of StorageProviderAbstract will be responsible for defining the URI and trading it for data. For instance it could use local storage with a path, or it could use online storage such as Amazon S3 or Google Storage to store blobs and use their bucket/blob_id as URI.
Note: for structured objects (BaseModels), serialization/deserialization will be the responsibility of our other open-source library Kajson which is already a dependency of Pipelex.

Obviously this will add overhead for loading/storage when entering/leaving every call to run_pipe(). But in many cases this will be faster than inference and anyway it's the way to durable resilient workflows. That said, the first version of this feature should be based on an in-memory StorageProviderAbstract to avoid the overhead and enable testing the load/store logic with minimum dependencies. The second way should be local storage, and that will make it easy and natural to use or dispose of any generated stuff at the end of the pipeline.

Alternatives considered

No response

Would you like to help implement this feature?

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions