Skip to content

Request for comment: Scalable architecture  #309

@ergonlogic

Description

@ergonlogic

Problem/Motivation

TUF metadata scalability is becoming an issue for us, both on the client-side, and for high-volume repositories.

Background

I'm the lead maintainer of the Rugged TUF Server. My work has primarily been sponsored by the Drupal Association, in an effort to implement TUF on their Composer repository of ~15,000 packages (~150,000 releases, 165K targets). TUF metadata for these packages currently weighs-in at ~29M.

We are using hashed-bins, with 2048 bins atm, but we're experimenting with performance at difference sizes. We have not (yet) implemented succinct hashed bins, as that only reduces the size of bins.json, which never changes in normal operations, and so only represents a relatively small metadata overhead.

We're aware of TAP 16 (https://github.com/theupdateframework/taps/blob/master/tap16.md) proposing Snapshot Merkle trees. This looks like it should address the problem of snapshot.json growing in line with the number of hashed bins. However, we do not believe that this will address the issues we're encountering (detailed below).

The maintainers of Packagist.org are interested in providing TUF coverage for their ~400,000 packages, and over 4.5M releases (~6M targets). At their scale, they're seeing peaks of upward of 20 new releases per second. Also, the delays imposed by consistent snapshots are a non-starter for them.

Dependency resolution overhead

For our use case, with Composer (client-side), each target is named like so: drupal/core-datetime/8.9.19.0, to keep releases distinct. We also sign the Composer metadata (composer.json) that accompanies each package (p2/drupal/core-datetime.json?).

When Composer is resolving dependencies, it must download many of these composer.json files. However, due to how hashed-bins distributes these files, they end up spread across multiple bins. As a result, it's likely that a project that uses even a relatively small number of packages will need to maintain a significant number of bin_n targets metadata, most of the contents of which will be irrelevant to the project.

Even if we were to share locally-cached TUF metadata across multiple projects, it would still result in an almost complete copy of the entire TUF repository metadata.

Proposed solution

Instead of scaling-up a single TUF repository, we're proposing that we can scale-out to many smaller repositories (possibly one per package), using the same root metadata and signing keys.

Each TUF repository (of which there would be over 400k) would be very simple. Hashed bins would not be required, since each would only contain an average of 10-15 targets. There should never be enough targets to warrant hashed bins, since each repo only contains the releases from a single package. Even if it were required, we could implement hashed bin selectively on a per-package-repo basis

From the client-side, there would be overhead of downloading timestamp.json and snapshot.json for each package they are using, but both these files would be very small. targets.json would scale with the number of releases. However, the client would never have to interact with any TUF metadata for packages not in use within their project.

This seems somewhat similar to the architecture of Notary, where each "collection" appears to be a something like a stand-alone TUF repository.

This also appears to make parallelizing of server-side operations much simpler, since it removes the issue of parallel processes trying to write and sign the same metadata. However, this may be specific to Rugged's architecture.

Root Metadata

We initially thought that we might be able to keep a single n.root.json file for all these repos, but that'd present problems when rotating online keys.

When rotating online keys, any metadata signed by the role whose key was rotated will need to be re-signed, which would take a non-trivial amount of time. As a result, we would want to be able to progressively rollout new root metadata (along with re-signed metadata).

So we expect to need to keep each repo's root metadata separately, even if they'll all be the same (most of the time).

Mirrors.json

We've looked at mirrors.json as a potential way of implementing something similar to the above, insofar as being able to effectively split up a single repository into namespaces. But snapshot.json is still shared, and so this doesn't appear to be a fruitful path.

Providing trusted root metadata

From §2.1.1. ("Root role"): (https://theupdateframework.github.io/specification/latest/#root):

The client-side of the framework MUST ship with trusted root keys for each configured repository.

Likewise, from §5.2 ("Load trusted root metadata"):

We assume that a good, trusted copy of this file was shipped with the package manager or software updater using an out-of-band process.

We cannot reasonably ship hundreds of thousands of root metadata with the client-side implementation. With a per-package layout, this would need to be updated frequently, as new packages are added.

To provide trusted root metadata for all of these TUF repos, we envision a "meta-repository" that provides TUF-signatures for these root metadata. The client thus ships with a single top-level root metadata, while being able to download and verify the initial root metadata for each of the "sub-repositories".

For the scale of a software repository the size of Packagist, this repository of keys could implement hashed bins, for performance reasons, as it would contain hundreds of thousands of targets (initial root metadata).

No global snapshot

Each sub-repo is a full TUF repo, providing timestamp and snapshot metadata. However, in this scenario, we do not have a single view of all packages. The stated purpose of the top-level snapshot.json is mitigating Mix-and-match (and similar) attacks. However, the very existence of this file is at the crux of the scalability challenges that we're observing (and hypothesizing).

We believe this layout to remain reasonably resistant to such attack vectors. The top-level repo contains snapshot metadata covering each initial root metadata, while each sub-repo contains its own snapshot metadata. If this is deemed insufficient, we could maintain versioned root metadata (rather than just initial root metadata) as the targets of the top-level repo.

Our questions

  • Does the proposed architecture make sense? (eg. still secure, etc.)
  • Are there known alternatives for scaling to repositories of this size?
  • Would this require changes to the specification? (eg. "Providing trusted root metadata")
    • Maybe this is just an implementation-specific client/repo architecture, and so not relevant to the broader community?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions