-
Notifications
You must be signed in to change notification settings - Fork 31
Add CEP for Repodata Wheel Support #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Travis Hathaway <travis.j.hathaway@gmail.com>
|
|
||
| ### Pixi Integrates with uv (Jan 2024) | ||
|
|
||
| Pixi changes course to use uv directly instead of rip, which unlocks features like editable installations, and git and path dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are all now available for conda-only workflows through pixi-build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @pavelzw, thanks so much for the feedback! Do you think we are missing a milestone in our brief history section? This pixi build feature is more about building path/git for conda packages than installing wheels isn't it?
cep-XXXX.md
Outdated
|
|
||
| This CEP introduces a new optional `artifact_url` field in package records to specify download locations for individual packages. | ||
|
|
||
| > Note for this draft: The `artifact_url` field could also be added as a separate CEP to allow it for other record types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would actually be a good idea to avoid asymmetries in package record specifications. Either that or we explicitly mention this new field is for all package record types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, if we have rough consensus that this is a good approach, we should probably split artifact_url in to a separate CEP so that it can apply to all record types.
Co-authored-by: Travis Hathaway <travis.j.hathaway@gmail.com>
|
Thanks @travishathaway, thanks so much for all the updates! I applied them locally and then pushed a commit 👍 |
|
I'm going to leave my opinions on the general goals/ideas/features of this CEP in an effort to help bring other perspectives to this CEP. TL;DR - As a conda-forge/core developer, I personally would NOT recommend folks use this feature, nor would I enable or offer support for this feature for conda-forge. The CEP states
Here is a point-by-point explanation of why in my estimation this feature simply would not work for conda-forge.
In a non-trivial fraction of cases, even for pure-Python packages, the requirements in the conda-forge have subtle differences from the upstream requirements. These changes range from package renaming (e.g., These requirement differences will likely lead to some funky solves that either the conda or conda-forge developers will hear about.
Environment consistency is a tricky concept IMHO. For sure with this CEP one can in some cases create an environment where all constraints are satisfied. However, if the repodata is wrong, due to the issues outlined above, the formal consistency of the requirements doesn't really matter. Reproducibility is an even trickier concept. For an environment to be reproducible, one needs to have the same solver with that solver run under the same conditions. Let's assume the conda version is fixed and the solver command is run on the same machine. Even then the constraint on having the same conditions combined with this CEP in effect means that both the upstream wheel metadata and the conda channel metadata have to be the same. Given that the most likely source of wheels is pypi, there is no way one can promise those same conditions. Even if we restrict ourselves to environments built from lock files, the combination of the conda channel with pypi as a source of wheels will also not always be reproducible. PyPI users can delete packages (as opposed to simply yanking packages) and those deletions will break even locked envs. We do not allow package deletions on conda-forge for this exact reason. Thus for conda-forge, we could not recommend using this feature for reproducible envs from lock files unless PyPI turns off the ability for users to delete files.
The vast majority of the cognative burden is the differing and interacting repodata, not whether or not one types
For the reasons stated above, I am personally skeptical injecting pure-Python wheel metadata into the repodata would consistently result in a correct-enough environments to eliminate the need for new conda builds or the need to repackage pure-python Packages in conda-forge. I am not saying this doesn't work some of the time. Instead I am saying that the solution proposed in this CEP is not so much better than the current "conda, then pip" solve that it can achieve the arguably difficult goals above. Stated another way, in a world where this feature existed instead of the feature to pip-install overtop of conda environments, conda-forge likely would still need/want to repackage everything. Other comments
This statement is a red-flag for me on this CEP. First, conda-forge itself doesn't have an authoritative mapping of its own packages back to python wheels. There are several approaches in the wild and none of them is standardized into an automated bit of repodata tools like conda/mamba/pixi/rattler can read and interact with. See the discussions of PURLS. Second, treating conda-forge as a special channel specifically in a CEP (as opposed to simply using it as a motivating usecase), is definitely IMHO the antithesis of what a CEP is supposed to be. conda is a set of tools and standards and should not be singling out any one purveyor of conda packages.
I don't follow this comment. Can you clarify? For sure I have used this operator in the run section before (see, e.g., https://github.com/conda-forge/ngmix-feedstock/blob/main/recipe/meta.yaml#L30), At minimum any requirement that is |
|
Here is one other point that I think is worth considering. One way I can imagine conda-forge using this feature is through only injecting items into This procedure has some advantages for conda-forge that might actually be worth considering more generally. These are
One issue I have left unaddressed here is testing new repodata entries before they are added. We'd want to build at least one test environment and insure the package, at minimum, imports before we pushed it out to the world. On thing I am noticing is that as we add on these additional requirements and desires, it seems almost simpler for conda-forge to use its existing feedstock infrastructure. We'd likely have to build a new "staged-wheels" system to manage this kind of process. |
|
Hi @beckermr, thanks so much for your time reviewing and responding to this draft, I really appreciate it! I would like to use your feedback to strengthen the draft. Version exclusions
Thanks so much for pointing this out. My updated understanding is that I'll update the CEP to make sure that is clear. Mapping names centrally to conda-forge
Great points. We were probably trying too hard to make the community approach the standard, but as you point out, it isn't currently standardized. I think having the wheel index own the mapping is still the right approach, what if we have the index declare which channel it is mapping names to. For example, we could add an optional field called What do you think about this idea? PyPI can delete packages
Another great point that we could address in the CEP! There has been discussion from the PyPI community over the last year about standardizing around the deletion policy. For example there was a withdrawn PEP 763 and the Discuss Python.org topic about it. The consensus came down to:
This would be a tradeoff of directly using PyPI packages. Users would get access to thousands of packages with no extra hosting requirements, but they would also be subject to how PyPI currently works. Someone using a PyPI wheel repodata would have to decide if that is a good tradeoff for them. However, the lock file formats (rattler-lock-v6 and conda-lock-v1) already support a hybrid ecosystem with PyPI sections in the lockfiles. If someone wants to use a wheels channel directly from PyPI, it isn't better or worse than we have right now for reproducibility. In fact, channels that mirror/store wheels (as you suggest below) would actually improve reproducibility compared to the current "conda then pip" workflow.
As you point out, there are workflows where we could make the system more reproducible than we have now. Do you think that the CEP should recommend that production channels mirror wheels to ensure reproducibility, rather than relying directly on PyPI URLs? Downstream patching ability for conda-forge (and other ecosystems)
I would love to help find the right solution to this! Thanks again for the really valuable perspective. I think there are two complementary approaches we should take:
You're absolutely right that there is a burden caused by is conflicting metadata. Repodata patching is how we could address this, channels can correct metadata conflicts so users don't encounter them. However, I also think the current client workflows are also a burden. Giving users seamless access to thousands of packages without requiring them to know whether they're from PyPI or conda channels would solve a huge pain point.
You're right to be skeptical about fully eliminating the need for feedstocks. This CEP won't replace conda-forge's packaging infrastructure - metadata differences mean many packages will still need proper conda recipes. This is about handling the simpler cases more efficiently. Think of it as an additional tool for easier pure-Python packages, not a replacement for feedstocks. I'll update the CEP to better capture this view. Implementation plan for a wheel channel
I am really liking your thoughts on how we could implement this, nice! I am not sure if the implementation plan should be part of this CEP or not, so I would be grateful for everyone's thoughts on that. However, I really like where you are going with this plan and I also envision some sort of phased approach. We could start with fully manual curation, but then move toward semi-automated as we learn from the manual process. This balances lower barrier (no recipes needed for many packages) with quality control. Complex packages should still use feedstocks, but this handles the simpler pure-Python case more efficiently. It would be amazing to use a conda-forge wheel channel as a test case for this if the community is interested. Thanks again for all of the extremely valuable input, I'm looking forward to hearing more of your thoughts as we continue to refine this draft. |
|
Not commenting on the whole CEP (I share much of @beckermr's reservations, but currently don't have a strong opinion), just one aspect that's important to get right IMO:
I think it would be a good idea to consider whether this can build on top of the proposed PEP 804 (discourse), which tries to standardize something useable around the whole name mapping issue. It'd be certainly better if we can build on top of that (and help support that PEP) rather than inventing yet another scheme. |
Updates include: - Clarifications on naming standards - Channel mapping - Patching capabilities for dependency management # Conflicts: # cep-XXXX.md
|
Hi @beckermr, I made the updates to the relevant sections to fix version exclusions, remove reliance on conda-forge for channel mapping, clarify that we need downstream patching ability, added an implementation options section, and added recommendations for protecting against PyPI packages being deleted. Thanks again for all of the great feedback. Hi @h-vetinari
Thanks that's a great point that we should build on this! I added a call out to PEP-804 in the Naming standard and channel mapping section. |
JeanChristopheMorinPerso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left mostly questions
| This CEP has the following known limitations: | ||
|
|
||
| 1. **Pure Python only:** This CEP explicitly does not address wheels with binary extensions, which require platform-specific compatibility guarantees beyond the current scope. Conda’s strength is binary compatibility, so using conda packages may be the optimal solution. | ||
| 2. **Environment markers**:** Only Python version markers are converted to dependencies. Other environment markers (OS, platform, etc.) are ignored based on the pure Python assumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this create problems for packages that actually use os/platform markers to only depend on a package for one platform? How would the channel operator deal with that? For example, let's look at this example:
- package A: depends on
H; platform == linux - package H: Only has a wheel for linux
Now, if a user was to try to install A on Windows, they would get a solver error. This seems wrong if A would still work completely fine on Windows without H.
Repodata patching is possible, but in the case I just showed, a channel operator would be forced to likely publish a bad package first and then patch the repodata entry if the repodata is generated on the server side purely based on the wheel metadata (like anaconda.org or conda-index does with conda packages for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JeanChristopheMorinPerso, since conditional dependencies is a separate CEP, what if we said that channel operators MAY patch and that they SHOULD support further conditional dependencies as they become available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its important that we do describe how to convert those markers when conditional dependencies do exist. Because not all markers are trivial to convert (e.g. platform_machine, python_full_version vs python_version). I think that should be in this CEP but I can also imagine that a following could work.
| - `!=X.Y.Z` → Add to `constrains` field | ||
| - **Multiple specifiers:** Combine with commas (e.g., >=1.0,<2.0) | ||
| - **Python version requirements:** Convert Requires-Python to explicit python dependency | ||
| - **Environment markers:** Ignore markers other than Python version (pure Python assumption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pure python packages can still depend on platform specific packages. E.g. "tzdata; platform_system == "Windows""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the python implementation may be important (PyPy vs CPython, etc)
| - A shared relative or absolute `base_url` with all wheels in the same directory, by populating the `base_url` field and leaving the `artifact_url` field empty. | ||
| - A manual PyPI repository with wheels in directories by the package name by populating the absolute URL in the `artifact_url` field, or the `base_url` and a relative path in the `artifact_url` field. | ||
| - External PyPI mirrors or CDNs using absolute URLs by populating the `artifact_url` field, for example to <https://files.pythonhosted.org/packages/.../package-1.0.0-py3-none-any.whl> | ||
| - Mixed sources within the same repodata file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the artifact_url populated for conda packages? Should the index.json file include that? Similarly for wheels. How are they indexed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @baszalmstra, I was thinking that if we want artifact_url support for conda packages that we should make it a separate CEP (it sounds like a good idea to me, and I can draft one). I don't think the index.json currently contains URLs, so I don't see why we would add them for this new field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to do that in a separate CEP indeed. The information missing is how this field should be populated by conda-index/rattler-index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final URL cannot be part of index.json because the package doesn't "know" where it will be served from (same as sha256, it cannot predict its own hash). This information is only available to the indexing tool.
jaimergp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have read the CEP and added a few comments. I am supportive of having better PyPI support in the conda ecosystem so users don't have to try their luck with multiple overlapping tools, but I have reservations with the scope and direction of this proposal.
First, it doesn't tell the whole story. The current CEP focuses on pre-processing wheel metadata in a friendly way for the conda solver. It doesn't go into the details and nuance of name mapping (with the complications it brings!), or what to do with the solved records once a solution is found: how to install the wheel, how to cache it, how to populate its conda-meta/*.json metadata. Ideally, the CEP would cover the whole story, or at least contextualize where it sits in the pipeline and which other CEPs to consult to get the full story.
I have reservations about the metadata-only approach too, and I think this needs to be better discussed in the Rejected ideas section. Why is this approach so desirable compared to the others should go in a Rationale section. IIUC, this is because:
- Mirroring wheels is expensive and maybe unnecessary
- Metadata patching is desirable
- Easier to implement from the solver side of things
Also (unrelated, but worth exploring), if we do go this way, the proposal doesn't need to limit the strategy to wheels only. If we wanted to offer an agnostic field, like packages.*, where any package format could be allowed, it wouldn't take much more work:
urlshows where to download the artifact fromfninforms of the format, and the client would know how to deal with itdependsand others need to conform to conda conventions and name mappings
(This is to say that a big part of the specification right now doesn't seem to be very wheel-specific).
That aside, there are some editorial changes that would need to be made, but let's get there once we have agreed on a direction better.
|
|
||
| ### Add more conda packages | ||
|
|
||
| Create and maintain new conda packages for each PyPI dependency needed. Tools like [Grayskull] exist to make this easier to convert. However, this is a significant workload for the community, with over half of all conda-forge packages being pure Python. Even with more dedicated resources, creating recipes for over 400 thousand pure Python packages is not achievable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see a conda-forge/wheels-index repository where the automation pipelines are maintained folks add their requests to:
- Add a project to the watchlist so it is included in the index
- Yank certain wheels
- Repodata patch
- Import and archive pure wheel feedstocks
- etc
I don't know if the resulting repodata would be added to conda-forge proper, or maybe a separate conda-forge-wheels channel (to keep the production channel lighter).
Updated version to replace #144, developed with @travishathaway.
This CEP outlines how native support for pure Python wheel packages could be achieved by adding support for them in repodata. When implemented, conda clients will be able to seamlessly install conda packages and pure Python wheels from enabled channels.
Checklist for submitter
cep-0000.mdnamedcep-XXXX.mdin the root level.Checklist for CEP approvals
${greatest-number-in-main} + 1.cep-XXXX.mdfile has been renamed accordingly.# CEP XXXX -header has been edited accordingly.pre-commitchecks are passing.