Skip to content

Add persistence to iowarp #212

@lukemartinlogan

Description

@lukemartinlogan

@AGENTS.md

Implement crash-restart in context-runtime and persistence and in the context transfer engine. Implement the following changes:

Admin::Compose

In addition to the compose key, we should have a new key called restart.
We should store a log of all the services to automatically start in a directory configured in the chimaera conf. This should be called conf_dir. By default, this will be $HOME/.chimaera. When we do cmake build (not install), we should automatically create this directory for users.

During the Compose function, when iterating over a compose file, if restart: true (which is default OFF), a copy of the compose file will be placed in the conf_dir/restart directory. Directory created if DNE.

At the end of ServerInit, we will launch a new task called Admin::RestartContainers. This will iterate the restartable containers. Upon restart during Chimaera::ServerInit, we will iterate over every file in conf_dir/restart. It will create a pool and automatically restore the PoolId. ContainerId will be recalculated along with the domain tables. Then the container->Restart function will be called to fully restore the state.

chimara_compose --unregister [compose-file]

This will unregister all services in the compose file.

Contianer::Restart

This should be a new virtual method apart of the Container class.
This implements the ability to restart a system after a runtime crash.
This will read all metadata from the persistent metadata log and iterate until the metadata table is reconstructed.
The path to the metadata log should be specified as a new parameter in the CTE configuration.

FlushMetadata

This will look at the tag and blob table and update a persistent log of metadata changes. It should only store metadata records of things that change. Every time BlobInfo or TagInfo gets modified, there should be a counter that gets incremented. In addition, there should be another counter that stores the last time a FlushMetadata occurred on the data structure.

If the metadata log reaches a configurable maximum size, a snapshot of the current metadata table will be taken and placed in a new file. The old one will be destroyed.

FlushData

This will flush data from volatile storage to persistent targets. There are 3 categories of storage targets:

  1. Volatile
  2. Temporary-Nonvolatile
  3. Long-Term

The task should take as input the following:

  1. The level of flushing. So (1) means flush only volatile. (2) means flush both volatile and temporary.

The CTE configuration should be updated to have the following:

  1. flush_data_period: how frequently to flush volatile data to (2) or (3).

There should be one async task spawned during Create for FlushData with level (1) if the metadata logging is enabled by the CTE configuration (non-empty string).

PutBlob

In addition to score, the Context should take as input the minimum persistence target. We filter out targets that do not meet the threshold.

BDEV

We should update the bdev configuration to support specifying the persistence (volatile, temporary, long-term). By default, DRAM is volatile and everything else long-term unless otherwise configured. This should be a feature of the bdev, not the target. The RegisterTarget function should be able to get this information, though.

Unit Testing

We should add the following integration test named restart:

  1. Make a chimaera_compose with restart true that launches bdev (ram) + cte
  2. Run WRP_RUNTIME_CONF=chimaera_compose.yaml chimaera_start_runtime
  3. Put 10 blobs into the CTE. Call FlushMetadata and FlushData.
  4. Shut down the runtime
  5. Restart the runtime with WRP_RUNTIME_CONF unset.
  6. Check that the 10 blobs exist

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions