Using a new Docker version (27) to run an old-ish docker image saved in the dataset causes DataLad error (image ID mismatch)

With Docker 27, trying to run a docker container which was saved using an older version of Docker results with an error:

```
>python -m datalad_container.adapters.docker run container/image sh -c "echo 123"
(...)
RuntimeError: docker image sha256:f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3 was not successfully loaded
```

Docker loads an image, but its ID does not match what DataLad expects based on the image that was stored:

```
>docker image ls
REPOSITORY   TAG       IMAGE ID       CREATED         SIZE
remodnav     latest    81aaa31870f5   16 months ago   3.8GB
```

This was observed when trying to reproduce [paper-remodnav](https://github.com/psychoinformatics-de/paper-remodnav/) ([versioned link](https://github.com/psychoinformatics-de/paper-remodnav/tree/57d2565ab70b9d7b41c28aaa253ccc7096900f8d)), and snippets in this issue are based on that dataset.

## Which software versions are affected?

Unclear. The problem was observed and later confirmed on Windows with Docker version 27.5.1. For me, the problem does not replicate on Debian 12 (bookworm) with Docker version 20.10.4 (`docker.io` package). @mih reports that it still works on his laptop, with v26.1.5.

As far as saving the image goes, I don't know which Docker version was used; however, I suppose < 25 for reasons explained below.

## Where in the code does the problem happen?

The error message comes from the `datalad_container.adapters.docker` function:

https://github.com/datalad/datalad-container/blob/55309f8203ffd124668370deedd187b7a6420bdb/datalad_container/adapters/docker.py#L110-L150

The function performs a relatively simple operation: it creates a tar file object from the contents of the requested directory, and pipes it directly into `docker load` (all done with streams, without saving intermediate files). It then compares the image ID reported by docker to the one inferred from the image stored in the dataset - this is where the error is raised.

The expected ID is returned by `get_image`:

https://github.com/datalad/datalad-container/blob/55309f8203ffd124668370deedd187b7a6420bdb/datalad_container/adapters/docker.py#L88-L107

Again, the operation is relatively simple. The function opens the image manifest stored in the dataset, opens the config file it points to, and hashes its content.

## Investigating the docker save layout and speculation about IDs

With that dataset, I am able to mimic DataLad's approach in creating the tar file, and save it to a file for further inspection and for loading with `docker load -i`:

``` pycon
>>> with tarfile.open("img.tar", mode="w|", dereference=True) as tar:
...     tar.add("container\\image", arcname="")
```

Note: I tried writing the tar file on both GNU/Linux and Windows. The files had different checksums (new line characters? tar header?) but both produced the same image ID when loaded on Windows.

With that, I also tried a `docker load` - `docker save` round-trip. Docker 27 has no problem loading an image generated from the dataset content in the manner above. When saving, it produces a different layout - one that is OCI compatible in fact. See [OCI image format specification](https://github.com/opencontainers/image-spec) and, in particular, the part about [Image layout](https://github.com/opencontainers/image-spec/blob/main/image-layout.md).

The change in save layout was most likely introduced in Docker 25 - the [release notes for Docker Engine 25.0.0](https://docs.docker.com/engine/release-notes/25.0/#2500) include "The docker image save tarball output is now OCI compliant".

This is the layout of a tar file created from the dataset:

```
img_dataset
├── 360338cd2a802f4812f06fbc50237a42bc0303390efa7fa321c381e6ec36d1ae
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── 705094a41713537ec5205e79423114633a7225bae388e7ba823d92126c6b36c0
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3.json
├── manifest.json
└── repositories
```

And this is the one created after running `docker load` and `docker save`:

```
img_load_save
├── blobs
│   └── sha256
│       ├── 81aaa31870f52a6265bef39d0be0df7f82bab3839344ec8da54cc6c18e3fd7a0
│       ├── d310e774110ab038b30c6a5f7b7f7dd527dbe527854496bd30194b9ee6ea496e
│       ├── e2728fc6d2c404f7b41e0fa4f889117090f4476eefab2bca48d7164dcbf7a0cb
│       └── f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3
├── index.json
├── manifest.json
└── oci-layout
```

Note that the blobs include both `81aaa` (which matches the image ID reported by Docker 27) and `f881b` (which matches the ID that DataLad expected to see, and more than likely also the ID that Docker 20 would report).

Let's explore the new layout then (note: all JSON contents below are presented with `jq` for readability). First, there is `manifest.json`:

``` json
[
  {
    "Config": "blobs/sha256/f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3",
    "RepoTags": [
      "remodnav:latest"
    ],
    "Layers": [
      "blobs/sha256/d310e774110ab038b30c6a5f7b7f7dd527dbe527854496bd30194b9ee6ea496e",
      "blobs/sha256/e2728fc6d2c404f7b41e0fa4f889117090f4476eefab2bca48d7164dcbf7a0cb"
    ]
  }
]
```

The manifest references the config with `f881b` checksum - this is the "old" config, and the one DataLad would look at when determining the expected image ID! However, according to the OCI Image Layout Specification, this manifest is a "file associated with a backwards compatible docker save format", and is not part of the spec.

The mandatory file, acording to the OCI spec, is `index.json`, and here are its contents:

``` json
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "manifests": [
    {
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "digest": "sha256:81aaa31870f52a6265bef39d0be0df7f82bab3839344ec8da54cc6c18e3fd7a0",
      "size": 586,
      "annotations": {
        "io.containerd.image.name": "docker.io/library/remodnav:latest",
        "org.opencontainers.image.ref.name": "latest"
      }
    }
  ]
}
```

This index file points to a manifest, with a digest (`81aaa`) matching the ID of the dataset created by Docker 27.

Here is the content of that manifest, ie. `blobs/sha256/81aaa...`:

``` json
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "digest": "sha256:f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3",
    "size": 3157
  },
  "layers": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar",
      "digest": "sha256:d310e774110ab038b30c6a5f7b7f7dd527dbe527854496bd30194b9ee6ea496e",
      "size": 77814784
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar",
      "digest": "sha256:e2728fc6d2c404f7b41e0fa4f889117090f4476eefab2bca48d7164dcbf7a0cb",
      "size": 1750877184
    }
  ]
}
```

This manifest points to a config file with `f881b` digest, ie. exactly the one from the dataset!

It would seem that it is this manifest, rather than the config file, that docker uses as the basis for the dataset ID. However, given that it is checksums (of the config and the layers) all the way down, this seems to be equivalent (with Docker now hashing a "higher-level" metadata file). However, I wasn't able to find an indication of the ID change in Docker's release notes or documentation, so this is a speculation based on comparing the save layouts and reading the OSI spec.

## How can we fix this?

This is unclear at the moment.

If I am right about Docker 27's ID being based on a metadata representation which is equivalent but different to the file saved in the dataset, this means that with the old layout we can't know the ID upfront (unless we try to create the manifest ourselves, which seems doable but finicky).

One possible workaround would be to simply drop the ID check which produced an error. We would still rely on an exit code from `docker load` giving us some assurance that loading succeeded, so it does not sound entirely wrong.

However, the expected ID is being checked (against a list of Docker images being present) twice. The first time, it is done to decide whether the image needs to be loaded in the first place. So not changing that part would mean loading the image every time the function is called, which sounds bad.

	def load(path, repo_tag, config):
	"""Load the Docker image from `path`.

	Parameters
	----------
	path : str
	A directory with an extracted tar archive.
	repo_tag : str or None
	`image:tag` of image to load
	config : str or None
	"Config" value or prefix of image to load

	Returns
	-------
	The image ID (str)
	"""
	# FIXME: If we load a dataset, it may overwrite the current tag. Say that
	# (1) a dataset has a saved neurodebian:latest from a month ago, (2) a
	# newer neurodebian:latest has been pulled, and (3) the old image have been
	# deleted (e.g., with 'docker image prune --all'). Given all three of these
	# things, loading the image from the dataset will tag the old neurodebian
	# image as the latest.
	image_id = "sha256:" + get_image(path, repo_tag, config)
	if image_id not in _list_images():
	lgr.debug("Loading %s", image_id)
	cmd = ["docker", "load"]
	p = sp.Popen(cmd, stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.PIPE)
	with tarfile.open(fileobj=p.stdin, mode="w\|", dereference=True) as tar:
	tar.add(path, arcname="")
	out, err = p.communicate()
	return_code = p.poll()
	if return_code:
	lgr.warning("Running %r failed: %s", cmd, err.decode())
	raise sp.CalledProcessError(return_code, cmd, output=out)
	else:
	lgr.debug("Image %s is already present", image_id)

	if image_id not in _list_images():
	raise RuntimeError(
	"docker image {} was not successfully loaded".format(image_id))
	return image_id

	def get_image(path, repo_tag=None, config=None):
	"""Return the image ID of the image extracted at `path`.
	"""
	manifest_path = op.join(path, "manifest.json")
	with open(manifest_path) as fp:
	manifest = json.load(fp)
	if repo_tag is not None:
	manifest = [img for img in manifest if repo_tag in (img.get("RepoTags") or [])]
	if config is not None:
	manifest = [img for img in manifest if img["Config"].startswith(config)]
	if len(manifest) == 0:
	raise ValueError(f"No matching images found in {manifest_path}")
	elif len(manifest) > 1:
	raise ValueError(
	f"Multiple images found in {manifest_path}; disambiguate with"
	" --repo-tag or --config"
	)

	with open(op.join(path, manifest[0]["Config"]), "rb") as stream:
	return hashlib.sha256(stream.read()).hexdigest()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a new Docker version (27) to run an old-ish docker image saved in the dataset causes DataLad error (image ID mismatch) #269

Which software versions are affected?

Where in the code does the problem happen?

Investigating the docker save layout and speculation about IDs

How can we fix this?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using a new Docker version (27) to run an old-ish docker image saved in the dataset causes DataLad error (image ID mismatch) #269

Description

Which software versions are affected?

Where in the code does the problem happen?

Investigating the docker save layout and speculation about IDs

How can we fix this?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions