Skip to content

Add automated Archivematica deployment #200

@liam-lloyd

Description

@liam-lloyd

Goal

We now have Archivematica generating thumbnails sufficiently well that we'd be willing to turn off the legacy thumbnail generation process and rely on Archivematica for this functionality. However, before we make Archivematica a load-bearing part of Permanent, we should have an automated, reproducible way to deploy Archivematica, rather than deploying it and updating its configuration manually as we have been so far.

Challenges

An automated, reproducible deployment system should be able to delete the existing instance of Archivematica and build it anew. This means we can't rely on state stored on the Archivematica server. This poses numerous challenges, because currently quite a lot of state is stored there.

  • State stored in configuration files can live in template files in this repo instead. This would include the vars_singlenode_<VERSION>.yml used by ansible in setting up Archivematica, some of the config files where environment variables are stored, and the systemd service files necessary to set up four MCP clients rather than just one. It also includes the XML file that defines the default processing configuration.
  • State is also stored in Archivematica's MySQL database. Much of this state is irrelevant to us, because it records information about records that have already been successfully processed. However, some of it concerns records that are in the middle of processing, and some of it records the configuration of our S3 and Backblaze locations.
  • When records are in the midst of processing, state about them and where they are in the pipeline is stored in the local filesystem

Another, somewhat related challenge is graceful shutdown/rolling deploys. Our existing PHP deployments have neither of these properties.

Possible Approaches

Keep Archivematica on a Single EC2 Instance

This is the closest to how existing deployments in this repo work. We can handle configuration files on the Archivematica instance via templates like we do on existing deployments. We would need to write scripts to either hit the Archivematica API or the MySQL database directly to set up our S3 and Backblaze locations. We would need to accept that Archivematica would forget about previously processed records every time it's deployed (the way we use Archivematica makes this perfectly valid, but it could make certain new features harder to build in the future). We would also need to find a way to get Terraform to do a rolling deploy to avoid downtime during deploys (this is probably easy) and find a way to stop the old instance from shutting down before it's done with all its work (this might be hard).

Run Archivematica on an EC2 Instance and its Database on RDS

This would work much like the previous option, but state would be allowed to persist across deploys in the database. This would avoid the complexity of having to reconfigure our S3 and Backblaze locations every deploy, and avoid cutting off future opportunities by deleting Archivematica's memory of previous uploads on every deploy. It might make graceful shutdown easier but probably won't, because of the state stored in the filesystem during processing.

Run Archivematica on Kubernetes (and its Database on RDS, probably)

This is where we want to end up eventually, as it will allow most of the configuration to be expressed as code in a more straightforward way, and it will allow us to be much more flexible about scaling (in particular, it would open up easy ways to autoscale the number of MCP clients). However, there are significantly more unknowns on this path. Artefactual provides an ansible script for deploying Archivematica on a single server, it doesn't provide an analogous thing for deploying on Kubernetes (although this is a repo from CERN that could be used for guidance).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions