-
Notifications
You must be signed in to change notification settings - Fork 4
Description
@AGENTS.md
Implement crash-restart in context-runtime and persistence and in the context transfer engine. Implement the following changes:
Admin::Compose
In addition to the compose key, we should have a new key called restart.
We should store a log of all the services to automatically start in a directory configured in the chimaera conf. This should be called conf_dir. By default, this will be $HOME/.chimaera. When we do cmake build (not install), we should automatically create this directory for users.
During the Compose function, when iterating over a compose file, if restart: true (which is default OFF), a copy of the compose file will be placed in the conf_dir/restart directory. Directory created if DNE.
At the end of ServerInit, we will launch a new task called Admin::RestartContainers. This will iterate the restartable containers. Upon restart during Chimaera::ServerInit, we will iterate over every file in conf_dir/restart. It will create a pool and automatically restore the PoolId. ContainerId will be recalculated along with the domain tables. Then the container->Restart function will be called to fully restore the state.
chimara_compose --unregister [compose-file]
This will unregister all services in the compose file.
Contianer::Restart
This should be a new virtual method apart of the Container class.
This implements the ability to restart a system after a runtime crash.
This will read all metadata from the persistent metadata log and iterate until the metadata table is reconstructed.
The path to the metadata log should be specified as a new parameter in the CTE configuration.
FlushMetadata
This will look at the tag and blob table and update a persistent log of metadata changes. It should only store metadata records of things that change. Every time BlobInfo or TagInfo gets modified, there should be a counter that gets incremented. In addition, there should be another counter that stores the last time a FlushMetadata occurred on the data structure.
If the metadata log reaches a configurable maximum size, a snapshot of the current metadata table will be taken and placed in a new file. The old one will be destroyed.
FlushData
This will flush data from volatile storage to persistent targets. There are 3 categories of storage targets:
- Volatile
- Temporary-Nonvolatile
- Long-Term
The task should take as input the following:
- The level of flushing. So (1) means flush only volatile. (2) means flush both volatile and temporary.
The CTE configuration should be updated to have the following:
- flush_data_period: how frequently to flush volatile data to (2) or (3).
There should be one async task spawned during Create for FlushData with level (1) if the metadata logging is enabled by the CTE configuration (non-empty string).
PutBlob
In addition to score, the Context should take as input the minimum persistence target. We filter out targets that do not meet the threshold.
BDEV
We should update the bdev configuration to support specifying the persistence (volatile, temporary, long-term). By default, DRAM is volatile and everything else long-term unless otherwise configured. This should be a feature of the bdev, not the target. The RegisterTarget function should be able to get this information, though.
Unit Testing
We should add the following integration test named restart:
- Make a chimaera_compose with restart true that launches bdev (ram) + cte
- Run WRP_RUNTIME_CONF=chimaera_compose.yaml chimaera_start_runtime
- Put 10 blobs into the CTE. Call FlushMetadata and FlushData.
- Shut down the runtime
- Restart the runtime with WRP_RUNTIME_CONF unset.
- Check that the 10 blobs exist