Adding L1-L4 design to RFC-0050#93
Conversation
|
|
||
| #### L4: Always-On PR Checks | ||
|
|
||
| `L4` uses the same callback and synchronization behavior as `L3`, but it does not wait for a label. The `L4` scenario is effectively the same as `L3` Scenario 1 except that the downstream result is represented in upstream PR Checks by default, without requiring explicit labeling. |
There was a problem hiding this comment.
Was there a plan to allow out of tree testing to block PRs? It seems like we'd need a very tight agreement on availability if out of tree tests can halt merging on pytorch/pytorch.
There was a problem hiding this comment.
Hi @groenenboomj, we've discussed this in detail here: pytorch/pytorch#175022 (comment). For L4 level repos, they have the capabilities to gate PyTorch PR, but there will be a very high standard to achieve L4 (and the PyTorch core team remains the right to choose whether a repo is L4 even though it meets all the requirements).
There was a problem hiding this comment.
Okay, I see that was added in the last level breakdown expansion in March. Has there been more discussion on the requirements for gating?
There was a problem hiding this comment.
The main requirement is going to be this one:
Very few accelerators would reach this level. One only becomes L4 if pytorch maintainers are highly invested in it and find it worthwhile to add the signal to every PR.
It's intentionally left somewhat fuzzy, since it's a balance of "how much value will this provide vs how much pain does this inflict" since the idea itself is somewhat subjective. The automatic downgrade will need a firm reliability metric, but even that will likely need fine tuning based on experimentation, so hard numbers here in the short term are likely to be unreliable.
Do you have concerns about the approach @groenenboomj ?
There was a problem hiding this comment.
Okay, I see that was added in the last level breakdown expansion in March. Has there been more discussion on the requirements for gating?
Hey @groenenboomj, based on previous discussions with @albanD and @ZainRizvi, most accelerators will fall into L3. It’s difficult to set a well-defined standard for L4 because it is effectively equivalent to being in-tree, just with a different code location.
To me, a key requirement for L4 is broad community usage and official investment from core maintainers. Testing and feature coverage are necessary, but they aren't the only deciding factors.
There was a problem hiding this comment.
Less major concerns other than making sure we have enough formality documented so it's rigorous enough to not lead to as much debate when advancing through the levels. With that addressed it should be fine.
There was a problem hiding this comment.
Absolutely, I'm with you on that.
In my view, the CRCR consists of four levels. The formality of L1 and L2 should be clear and well-defined; I don't think many people would disagree with that. L3 isn't much of a concern either—we can progressively flesh it out and make it more concrete with input from @albanD and @ZainRizvi.
Regarding L4, I don't believe it's an urgent issue that requires immediate resolution. Since it is considered equivalent to "in-tree" accelerators like CUDA, and most accelerators won't be able to reach this level in the short term, we can afford to de-prioritize it for now.
Sure! I will discuss this with @jewelkm89 ASAP and update it to the RFC as well. |
There was a problem hiding this comment.
Hey @can-gaa-hou, the LF AI & Data Foundation is wrong, could you please correct this one in passing, it should be LF Pytorch Foundation
There was a problem hiding this comment.
Ok, let me make this change
| ``` | ||
|
|
||
| ## HUD Integration | ||
|
|
There was a problem hiding this comment.
@jewelkm89, as mentioned in https://docs.google.com/document/d/1w10Agz_YkSA1IDurobWjuNoYo4u9aykprjuymDFE6MU/edit?tab=t.5c09vnsaacwx#heading=h.t7zh9oz5xu0s, you can add the HUD section here and paste the Google Doc link for reference.
|
@fffrog Do we have signup instructions documented anywhere? E.g. If someone wants to join the L1 tier, they need to join the allowlist and add the github app to their repo. How does one find the link for the github app? |
@ZainRizvi, I added an instruction a few days ago in the PyTorch documents. Please refer to it. https://github.com/pytorch/pytorch/blob/main/docs/source/accelerator/ci.md |
Hi @ZainRizvi, Sure thing. We've added a dedicated documentation page for this. It's currently a preliminary version and will be updated as the CRCR move forward. As mentioned in the previous comments by @can-gaa-hou, the link is https://github.com/pytorch/pytorch/blob/main/docs/source/accelerator/ci.md. However, the workflow seems to be consistently failing due to timeouts. As a result, the rendered pages aren't being generated correctly, so users can currently only access the raw Markdown file rather than the properly formatted web pages. Could you please take a look at this workflow? @can-gaa-hou tried to fix it, but it seems to be beyond our scope/control |
Thanks for the pointer, will raise the problem internally. There was one bug causing that workflow to fail that was fixed already, but there seems to be another one causing the deployment step to fail as well: https://github.com/pytorch/docs/actions/workflows/pages/pages-build-deployment |
As @ZainRizvi mentioned here, I am opening this PR for adding L1-L4 design in our RFC. This may help us better develop L2-L4 implementation.
cc @fffrog @ZainRizvi @albanD