-
Notifications
You must be signed in to change notification settings - Fork 31
Add purls (Package URLs) to PackageRecord
#63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Awesome CEP! :) |
|
|
||
| ## Abstract | ||
|
|
||
| This CEP describes a change to the `PackageRecord` format and the corresponding `repodata.json` file to include `purls` (Package Urls) of repackaged packages to identify packages across multiple ecosystems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to the definition of a PackageRecord? I struggle to find an authoritative source for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I believe that atm there is no actual "authorative" source.
There is this relatively old definition of a RepoDataRecord: https://github.com/conda/schemas/blob/main/repodata-record-1.schema.json
There is this new effort to document the schemas better (conda/schemas#26) where it's also called RepoDataRecord: https://github.com/conda/schemas/blob/b143c82a71833570fbe9be2313368b33c0e84726/conda_models/package_record.py#L23
And we have the definition in rattler: https://docs.rs/rattler_conda_types/latest/rattler_conda_types/struct.PackageRecord.html
In rattler (and I believe in conda as well), there is this distinction:
PackageRecord: contains all the fields for a single entry in therepodata.jsonRepoDataRecord: inherits all fields fromPackageRecordand adds fields to identify the origin of the data (channel, url, etc.)PrefixRecord: inherits all fields fromRepoDataRecordand additionally stores information about how the package was installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think the most "official" source for this is https://github.com/conda/conda/blob/e783377439ed1c413c6bffb9b785ae1d79c2392a/conda/models/records.py#L247. That module also offers some sort of definition in the top-level docstring.
Implementation of conda/ceps#63
This PR adds support for checking the satisfiability of the lock-file which includes pypi-dependencies. Purls have been added to the lock-file (conda/rattler#414) (See also: conda/ceps#63). This enables checking which conda packages will install which pypi packages without needing to check the internet. This ensures we can still check if a lock-file is up to date quickly. I did not profile this code but I think there are a lot of places we can improve the performance. Thats for a later PR. I also didn't add tests. I think we should but we can also do that in another PR. Closes #467 --------- Co-authored-by: Ruben Arts <ruben.arts@hotmail.com>
cep-purls.md
Outdated
| * We can keep this information close to the conda package description. | ||
| * We can incrementally add `purls` through repodata patches. | ||
|
|
||
| The downside is that the (already large) repodata.json file will grow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we add a separate-yet-adjacent purls.json like we did with run_exports.json in CEP-12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just include them with sharded repodata but for backwards compatibility it seems logic to go the same route as run exports as well (also allowing patching them).
jaimergp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea and I will be supportive. Havin this metadata readily available would allow us to be listed in repology.org, for example! It would also play nicely with the (draft) PEP-725 for external metadata in PyPI.
However, I think this CEP right now is talking about serving metadata before we have discussed how to source it, define it and store it.
Whatever ends up in the repodata.json comes, in part, from the info/index.json metadata inside the conda artifact. Then this is augmented with things like sha256 and final size by conda-index (because they cannot be known when the package is being archived).
So before we speak about repodata, we should discuss where in the inner artifact metadata we will store the PURL info. To answer that, we must answer where in the conda-build recipe we will include that information :D
IOW, I'd like to know your thoughts about:
- Where in the current
meta.yamlwe should define the PURLs.aboutseems to be the most obvious one, which means this will probably end up ininfo/about.json. - Whether to serve the PURLs separately in a
purls.jsonor not. I honestly don't think putting it inrepodata.jsonis a good idea. I get that it makes sense if you want to have a canonical link between PyPI in conda-forge so Pixi can solve things nicely. It might also be served inchanneldata.json(since most of the time PURLs are tied to the source not the platform-dependent, target artifact).
|
Would this also help us address Repology's needs for supporting Conda packages ( repology/repology-updater#518 )? Edit: Nvm missed Jaime has the same idea |
I agree that While this would facilitate simplicity, avoid redundancy, and avoid errors in the recipe, I see the following downsides with that solution:
I do not have a strong opinion here since I am not too involved with the tools that would need to process that data. |
|
I think a broader question is whether
To put this in the context of the above, a given It might make sense to advocate for some changes to the
While i don't think much can be done about "where you got the source tarball" (because GitHub sources, etc), I don't think a recipe author should have to calculate all these things... but certainly could given the available data today: # meta.yaml
{% set version = "1.10.1" %}
package:
name: django
version: {{ version }}
# ...
about:
# ...
purls:
- pkg:pypi/django@{{ version }}
# this should be fully automated, either at build time (weird?) or trivially-derivable
- pkg:conda/{{ channel_targets.split(" ")[0] }}/django@1.10.1?subdir={{ target_platform }}&label={{ channel_targets.split(" ")[1] }}&build=py{{ py }}_{{ build_number }}So the above full purls:
- pkg:pypi/django@1.10.1
- pkg:conda/conda-forge/django@1.10.1?subdir=win-32&label=main&build=py35_0 |
|
Thinking about this more in the context of "accidental cross-ecosystem namesquatting" on zulip: as dependencies:
- pkg:pypi/django >=1.10.1,<1.11treating everything after the whitespace as "this part is about conda" would still allow for all our variant business, but presumably could eventually be expanded to allow per-ecosystem fields... luckily, pypi only has semi-irrelevant stuff like |
I don't follow entirely. What would your example refer to? The PyPI package or the corresponding conda-forge package? |
Right, the user wants the corresponding # e.g. in pixi.toml
[dependencies]
# | a new package identifier
# V
"pkg:pypi/django" = ">=1.10.1,<1.11"
# ^
# | the conda constraints, in the MatchSpec grammar
"pkg:golang/github.com/rhysd/actionlint" = ">=1.7.7"Where this would be most excellent, for the PyPI case, is if the spec There is no consensus in An extreme case might be # e.g. in rattler-build recipe.yaml
recipe:
version: ${{ version }}
outputs:
# with fully-specified purls
- package:
name: fastapi
purl: pkg:pypi/fastapi@${{ version }}
dependencies:
run:
- pkg:pypi/starlette >=0.40.0,<0.42.0
- pkg:pypi/pydantic >=1.7.4,!=1.8,!=1.8.1,!=2.0.0,!=2.0.1,!=2.1.0,<3.0.0
- pkg:pypi/typing-extensions >=4.8.0
# or maybe it makes sense to CURIE them, using a `pip:`-like syntax
- package:
name: fastapi-standard
purl:
pkg:pypi:
- fastapi[standard]@${{ version }}
dependencies:
run:
- ${{ pin_subpackage("fastapi", exact=True) }}
- pkg:pypi:
- fastapi-cli[standard] >=0.0.5
- httpx >=0.23.0
- jinja2 >=2.11.2
- python-multipart >=0.0.7
- itsdangerous >=1.1.0
- pyyaml >=5.3.1
- ujson >=4.0.1,!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0
- orjson >=3.2.1
- email-validator >=2.0.0
- uvicorn[standard] >=0.12.0
- pydantic-settings >=2.0.0
- pydantic-extra-types >=2.0.0The latter form would all but remove any package-naming impedance, making tools |
|
Opened #114 to keep track of the alternatives suggested in the latest comments! |
|
|
||
| ## Specification | ||
|
|
||
| We propose to add the optional `purls: [string]` field to `PackageRecord`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @jaimergp mentioned, we should also mention in this CEP where in the built conda package this should live and where it is specifiable in the recipe.
IMO, index.json is a good fit for it as IMO, it should also go into repodata (at least into sharded).
Adding it to index.json will lead to conda-index automatically adding it to repodata.json. This is IMO a bad default in conda-index; it should instead just keep a whitelist of things to put into repodata.json instead of putting everything from index.json in there.
|
i included some comments, especially about where PURLs are stored in a recipe, where they are stored in the repodata, how to patch it as well as what a PURL of a conda package itself looks like. PTAL again |
| Tools that generate packages like rattler-build and conda-build should also be able to inject a PURL at build time via CLI flags. | ||
| For this, v1-recipes can use the jinja syntax from the recipes. | ||
|
|
||
| ```bash | ||
| rattler-build build -r recipe/ \ | ||
| --append-purl-pattern \ | ||
| 'pkg:conda/conda-forge/${{ PACKAGE_NAME }}@${{ PACKAGE_VERSION }}?build=${{ BUILD_NUMBER }}' | ||
| ``` | ||
|
|
||
| Variables that are available in the build process (like `PACKAGE_NAME`, `PACKAGE_VERSION` and `BUILD_NUMBER`) can be specified here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one yet. We need some way of adding the PURL to conda-forge packages.
An alternative would be to enforce with conda-smithy that something like this is always in the recipe:
about:
purls:
- pkg:pypi/pinject@0.14.1
- pkg:conda/conda-forge/${{ name }}@${{ version }}?build=${{ build }}this would be a bit more explicit. But we would not be easily able to add things like &platform=osx-arm64
Passing it through the infrastructure via CLI flags makes sure it's correct and would immediately apply to each new build though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing it through the infrastructure via CLI flags makes sure it's correct and would immediately apply to each new build though.
Yeah, I think there is no reason why conda-forge feedstock maintainers should be able to change the purl of their package
| Conda packages itself should have a PURL as well. | ||
| This makes it possible for CVE Numbering Autorities (CNA) to publish vulnerabilities for conda packages. | ||
| A package `pinject-0.14.1-pyh29332c3_1.conda` published on `conda-forge` should have the PURL `pkg:conda/conda-forge/pinject@0.14.1?build=pyh29332c3_1&platform=noarch`. | ||
| Packages can also specify a custom `repository_url`, for example `pkg:conda/custom-channel/my-package@0.1.0?build=pyh29332c3_1&platform=noarch&repository_url=prefix.dev`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, repository_url should be an actual URL, not a domain name: https://github.com/package-url/purl-spec/blob/main/PURL-SPECIFICATION.rst#known-qualifiers-keyvalue-pairs
repository_url is an extra URL for an alternative, non-default package repository or registry. When a package does not come from the default public package repository for its type a purl may be qualified with this extra URL. The default repository or registry of a type is documented in the "Known purl types" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, in their examples they use
pkg:docker/customer/dockerimage@sha256:244fd47e07d1004f0aed9c?repository_url=gcr.io
pkg:maven/org.apache.xmlgraphics/batik-anim@1.9.1?repository_url=repo.spring.io/release
|
|
||
| Conda packages itself should have a PURL as well. | ||
| This makes it possible for CVE Numbering Autorities (CNA) to publish vulnerabilities for conda packages. | ||
| A package `pinject-0.14.1-pyh29332c3_1.conda` published on `conda-forge` should have the PURL `pkg:conda/conda-forge/pinject@0.14.1?build=pyh29332c3_1&platform=noarch`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we worry that this diverges from https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst#conda? This seems to be quite problematic IMO. In the spec, channel is a specifier (pkg:conda/pinject@0.14.1?channel=conda-forge), subdir is named subdir, not platform, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Specs talk, and implementations walk: from a tooling perspective, we would likely need at least the first-party python and rust parser implementations to understand whatever key fields are proposed, and, similarly, coordinate with some current (if only partially) conda-supporting "leaf" packages that e.g. build SBOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now we need to stick to the PURL spec (even if I'm not a big fan of how they are currently defining the conda type). Changes like the ones proposed here need to happen there first, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked the PURL team today about how to submit changes to their spec, and right now they are focused on some large rearrangements of the specification (a centralized schema for type definitions and tests), so it's bad timing to submit type-specific (e.g conda) changes. But later this year this work will be done so type-specific feedback will be welcome by then. I'll stay in touch with them to monitor the situation and report once we can submit some actionable items. In the meantime, we can brainstorm as a community what we'd like to propose to the standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jaimergp. Should we maybe move this part out of this CEP to avoid blocking the CEP just for that?
Co-authored-by: Jean-Christophe Morin <38703886+JeanChristopheMorinPerso@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to see more activity here! Some comments and clarifications.
I also wonder about the following. I think we are conflating two items:
- Allowing a recipe to specify which packages are included in a given conda artifact
- Standardizing the conventions to assign a PURL to each conda artifact
It appears that both would end up in the purls field, but that doesn't sound right to me. (1) is about enumerating components in package, and (2) is about identifying a conda artifact. Mixing these two doesn't seem like a good idea.
I propose we split (2) from this CEP, and create its own mini-CEP, which in a way it should simply recognize the PURL spec and how to use it in the conda ecosystem. After all it just maps pkg:conda/numpy?channel=conda-forge to conda-forge::numpy.
Also please see #114 for some related discussions, I think part of the text there has some merit and I'd be happy to merge that PR here to centralize the conversation.
| @@ -0,0 +1,192 @@ | |||
| <table> | |||
| <tr><td> Title </td><td> Add package-urls to PackageRecord </td> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <tr><td> Title </td><td> Add package-urls to PackageRecord </td> | |
| <tr><td> Title </td><td> Add package URLs (PURLs) to PackageRecord </td> |
| <tr><td> Status </td><td> Draft </td></tr> | ||
| <tr><td> Author(s) </td><td> Bas Zalmstra <bas@prefix.dev>, Pavel Zwerschke <pavelzw@gmail.com> </td></tr> | ||
| <tr><td> Created </td><td> Nov 23, 2023</td></tr> | ||
| <tr><td> Updated </td><td> Nov 23, 2023</td></tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <tr><td> Updated </td><td> Nov 23, 2023</td></tr> | |
| <tr><td> Updated </td><td> Jul 15, 2025</td></tr> |
|
|
||
| ## Abstract | ||
|
|
||
| This CEP describes a change to the `PackageRecord` format and the corresponding `repodata.json` file to include `purls` (Package Urls) of repackaged packages to identify packages across multiple ecosystems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This CEP describes a change to the `PackageRecord` format and the corresponding `repodata.json` file to include `purls` (Package Urls) of repackaged packages to identify packages across multiple ecosystems. | |
| This CEP describes a change to the `PackageRecord` schema and the corresponding `repodata.json` files to include a new field, `purls`, to list the package URLs (PURLs) of the packaged projects in each conda artifact. |
|
|
||
| ## Specification | ||
|
|
||
| We propose to add the optional `purls: [string]` field to `PackageRecord`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We propose to add the optional `purls: [string]` field to `PackageRecord`. | |
| We propose to add the optional `purls: [string]` field to `PackageRecord`. |
I assume [string] is equivalent to Python's list[str]?
| ``` | ||
|
|
||
| PURL is already supported by dependency-related tooling like SPDX (see [External Repository Identifiers in the SPDX 2.3 spec](https://spdx.github.io/spdx-spec/v2.3/external-repository-identifiers/#f35-purl)), the [Open Source Vulnerability format](https://ossf.github.io/osv-schema/#affectedpackage-field), and the [Sonatype OSS Index](https://ossindex.sonatype.org/doc/coordinates); not having to wait years before support in such tooling arrives is valuable. | ||
| [PEP 725 (WIP)](https://peps.python.org/pep-0725) also proposes how to specify non-PyPi dependencies using PURLs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [PEP 725 (WIP)](https://peps.python.org/pep-0725) also proposes how to specify non-PyPi dependencies using PURLs. | |
| [PEP 725 (draft)](https://peps.python.org/pep-0725) also proposes how to specify non-PyPI (external) dependencies in Python packages using PURLs. |
| ```bash | ||
| rattler-build build -r recipe/ \ | ||
| --append-purl-pattern \ | ||
| 'pkg:conda/conda-forge/${{ PACKAGE_NAME }}@${{ PACKAGE_VERSION }}?build=${{ BUILD_NUMBER }}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we need to inject the PURL of the package being built into itself. All this information can be derived from the filename if we want to publish it somewhere else, but I don't see how publishing an easily buildable string adds value.
Also, the build qualifier is the build string not the number.
| } | ||
| ``` | ||
|
|
||
| ### PURL of a conda packages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section deserves its own CEP.
| ## Motivation | ||
|
|
||
| Conda packages can mostly repackage packages from other ecosystems. | ||
| Conda-forge and other channels famously repackages a lot of PyPI packages. | ||
| However, without actually downloading the conda package and inspecting its contents there is no reliable way to know whether a certain conda package is a repackaged package and which package it repackages. | ||
|
|
||
| Tools like pixi or conda-lock try to combine conda and PyPI packages through heuristics. This doesn't work deterministically as package names between the two indices may differ. | ||
|
|
||
| Its hard to use open-source vulnerability databases because they often do not contain conda packages. | ||
| Using the PURL standard allows us to link vulnerabilities from other ecosystems to conda package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this above Specification so it's clear why we are doing this. It would also need some more details about the intended usage and benefits, since there are many!
| "platform": "string", | ||
| "arch": "string", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are superfluous here.
| PURL is already supported by dependency-related tooling like SPDX (see [External Repository Identifiers in the SPDX 2.3 spec](https://spdx.github.io/spdx-spec/v2.3/external-repository-identifiers/#f35-purl)), the [Open Source Vulnerability format](https://ossf.github.io/osv-schema/#affectedpackage-field), and the [Sonatype OSS Index](https://ossindex.sonatype.org/doc/coordinates); not having to wait years before support in such tooling arrives is valuable. | ||
| [PEP 725 (WIP)](https://peps.python.org/pep-0725) also proposes how to specify non-PyPi dependencies using PURLs. | ||
|
|
||
| ### PURLs in recipes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to read recommendations on which PURLs should be included, e.g.:
- Sources being built, seems obvious
- Vendored dependencies?
- Statically compiled libraries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the plain-old-list-of-strings probably isn't enough, and
would benefit from reusing terms from e.g. SPDX or CycloneDX.
From a syntax perspective, perhaps consider:
- things that can be known in advance, defined directly as templated strings
- e.g. "this
{platform}/{pkg}-{version}-{build}.condais an alternative distribution of that.tar.gz"
- e.g. "this
- things that will only be known after a build
- e.g. "after a build, look on disk for this file, and add these things"
To use the slightly more concrete CycloneDX terms:
about:
purl:
- ancestor: pkg:pypi/pinject@${{ version }}?file_name=pinject-${{ version }}.tar.gz But for the 90% case, indeed, it's going to be one ancestor:
about:
purls: pkg:pypi/pinject@${{ version }}?file_name=pinject-${{ version }}.tar.gz A more complex thing might the the worst-case scenario of a .conda that ships
a statically compiled thing built in go, as well as some crazy npm garbage for
a frontend. These would be generated at runtime, but ideally could be backed-out
from e.g. go-licenses, etc.
about:
purls:
- ancestors:
- purl: pkg:golang/github.com/jaegertracing/jaeger@${{ version }}
- variants:
- file: dumb-list-of-go-purls.txt
- file: dumb-list-of-npm-purls.txtWhich could again be slightly compacted for the 80% case:
about:
purls:
- pkg:golang/github.com/jaegertracing/jaeger@${{ version }}
- variants:
- file:
- dumb-list-of-go-purls.txt
- dumb-list-of-npm-purls.txt
This CEP describes a change to the
PackageRecordformat and the correspondingrepodata.jsonfile to includepurls(Package URLs of repackaged packages to identify packages across multiple ecosystems.rendered