Skip to content

Conversation

allisonkarlitskaya
Copy link
Collaborator

@allisonkarlitskaya allisonkarlitskaya commented Apr 21, 2025

For images with a large number of layers (as is the case for most bootc images) this is a massive win. I used this as a test:

skopeo copy docker://quay.io/fedora/fedora-silverblue:42 oci:$HOME/oci/silverblue-42

and then

cfsctl oci pull oci:$HOME/oci/silverblue-42

And noticed an improvement from ~90s to ~9s.

Unfortunately the situation isn't much improved for images with a small number of layers (as is the case in our own examples/ directory).

@allisonkarlitskaya
Copy link
Collaborator Author

Run cargo publish --dry-run
    Updating crates.io index
warning: crate composefs@0.2.0 already exists on crates.io index
error: all dependencies must have a version specified when publishing.
dependency `containers-image-proxy` does not specify a version
Note: The published dependency will use the version from crates.io,
the `git` specification will be removed from the dependency declaration.
Error: Process completed with exit code 101.

hm. tricky. Looks like we might need to get bootc-dev/containers-image-proxy-rs#78 released or find another way?

@allisonkarlitskaya
Copy link
Collaborator Author

allisonkarlitskaya commented Apr 22, 2025

An earlier version of this PR had a objects: OnceCell<Arc<OwnedFd> field and an internal separate method for ensure_objects_in_dir() that took an Arc<OwnedFd> of the objects directory and did its work inside of that... I wonder if this approach is nicer in general (although it doesn't help solve the weird lifetime issues associated with the stream writer or the oci imageop).

This is somewhere where my lack of experience is definitely showing...

When we already have a layer, we print a message saying as much.  In
fb1de9f ("src/fsverity: up our FsVerityHashValue trait game") we
accidentally changed this to use the debug trait.  This would have been
correct if we were printing a layer ID, but this is an array, so we see
the bytes if we do that.  Revert to the previous behaviour.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Rename our `Repository::object_dir()` accessor to `.objects_dir()` (the
directory is called `"objects"`) and change its implementation to use a
OnceCell.  ostree uses this "initialize on first use" pattern for some
of its directories as well, and I like it.

We use the external `once_cell::` crate (which we already had as a
-devel dependency) because the .try() version of the API is not yet
stable in the standard library.

Clean up our ensure_object() implementation to use `.objects_dir()` and
generally make things a bit more robust:
 - we avoid unnecessary mkdir() calls for directories which almost
   certainly already exist
 - instead of checking merely for the existence of an object file with
   the correct name, we actually measure it now
 - we play around less with joining pathnames (and we can drop a
   now-unused trait helper method on FsVerityHashValue)

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Let's just have an `async fn main()` instead of doing this ourselves.
This also gets us access to the multithreaded executor, which we'll
start using soon.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Our implementations trivially satisfy all of these constraints.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Our `SplitStreamWriter` and `oci::ImageOp` structs contain simple
references to `Repository` which results in some awkward lifetime rules
on those structs.  We can simplify things substantially if we lean into
ref-counting a bit more.

I'm not yet ready to declare that Repository is always refcounted, but
for operations involving splitstreams (including oci downloads) it is
now required.

The ergonomics of this change surprised me.  The Deref trait on `Arc<>`
and the ability to define `self: &Arc<Self>` methods makes this all
quite nice to use.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
layers.sort_by_key(|(mld, ..)| Reverse(mld.size()));

// Bound the number of tasks to the available parallelism.
let threads = available_parallelism()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is tricky because we're intermixing CPU and I/O work; one problem I've seen in the past is running on large CPU count servers (e.g. 64+) but with more limited I/O bandwidth. In those cases we end up with 64 threads competing pointlessly for more limited I/O.

There's no convenient way to get any estimate for I/O parallelism unfortunately that I know of but in some equivalent places I've capped at an arbitrary number like 4.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky because we do a fair amount of computation in these workers as well: we compute the full Merkle tree twice (once in userspace, once in the kernel). And some of those threads will be sleeping some of the time, because they're doing fdatasync() or blocked on the download or whatever. I considered doing something like 2 * available_parallelism() in fact.

Different hardware combinations could end up being either CPU or IO bound, but if we use at least the available_parallelism() then at least we stand a decent chance of keeping at least the CPUs busy. If IO throttles, then it'll end up slowing down the CPUs...

let mut entries = vec![];
for (mld, diff_id) in layers {
let self_ = Arc::clone(self);
let permit = Arc::clone(&sem).acquire_owned().await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah using a semaphore for this makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I think I'll stick with it here...

let permit = Arc::clone(&sem).acquire_owned().await?;
let layer_sha256 = sha256_from_digest(diff_id)?;
let descriptor = mld.clone();
let future = tokio::spawn(async move {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/bootc-dev/bootc/blob/2de8e0d23fb89bed76722ff1466614afacec64b3/lib/src/fsck.rs#L195 which uses a JoinSet which is designed for this, especially that it enforces structured concurrency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I used this for the HTTP downloader.

env_logger = "0.11.0"
hex = "0.4.0"
indicatif = { version = "0.17.0", features = ["tokio"] }
log = "0.4.8"
oci-spec = "0.7.0"
once_cell = { version = "1.21.3", default-features = false }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks spurious? Also there's no reason to use the external crate since the functionality got merged into std.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned that in the commit message: f286208

We use the external once_cell:: crate (which we already had as a -devel dependency) because the .try() version of the API is not yet stable in the standard library.

@allisonkarlitskaya
Copy link
Collaborator Author

The only thing blocking this in my opinion is that it would be nice to have a new containers-image-proxy-rs release. Also: I might drop the await on the driver as we discussed in bootc-dev/containers-image-proxy-rs#80 (comment)

Add an async version of `Repository::ensure_object()` and wire it
through `SplitStreamWriter::write_external()`.  Call that when we're
splitting OCI layer tarballs to offload the writing of external objects
(and the `fdatasync()` that goes with it) to a separate thread.  This is
something like some prep work for something we've been trying to
accomplish for a while in containers#62 but it doesn't come close to the complete
picture (since it still writes the objects sequentially).

Modify the (already) async code in oci::ImageOp to download layers in
parallel.  This is a big deal for images with many layers (as is often
the case for bootc images, due to the splitting heuristics).  This takes
a pull of the Fedora Silverblue 42 container image (when pulled from a
local `oci-dir`) from ~90s to ~8.5s time to complete on my laptop.

Unfortunately, container images made from large single layers are not
substantially improved.

In order to make this change we need to depend on a new version of
containers-image-proxy-rs which makes ImageProxy: Send + Sync, so bump
our required version to the one released today.

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
This can cause the proxy's control channel to experience lockups in
heavily-threaded situations and we don't gain any benefit from doing it.
Just drop it.

See bootc-dev/containers-image-proxy-rs#80

Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
@allisonkarlitskaya allisonkarlitskaya merged commit e4e8c28 into containers:main Apr 29, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants