[Handshake] Postponing the connection of memory operations to LSQ or Memory Controller #254
Replies: 15 comments
-
|
I very strongly support this modification because I believe It would be extremely useful to hear from @lucas-rami about (1) whether the nonconnected annotated |
Beta Was this translation helpful? Give feedback.
-
|
While I appreciate the effort in explaining this proposal in some details, I am deeply against this change. I think it is based on multiple misconceptions of how one works with memory interfaces in Dynamatic. Not only that, but it fails to actually explain in a meaningful way how this would work were it to be implemented. Before I address the actual proposal, I want to put forward one design decision that I made in the past and for which I think there is a good argument to be made against (perhaps this will also clear up some things about how memory ports work). I decided at some point that load/store ports to MCs and LSQs should be different MLIR operations (for loads, Now onto the actual proposal. I will first address the misconceptions in the problem statement one-by-one, then comment on the proposed solution, and finally give my conclusion as to why this is not a desirable change.
To make this clear, the decision is made here with the amount of information available at this stage of the compilation because I want to produce a circuit with a valid memory network straight away. There is no loss of information in the placement process, on the contrary most of that information is now more clearly laid out in the CDFG as opposed to attributes. Absolutely nothing prevents you from going back on this decision later on once more information is available; it is in fact fairly easy to do so thanks to memory interface creation/replacement helpers I have introduced over time which are already used both from
Incorrect,
Not really. We made the design decision to do it here but the analysis part is perfectly doable pre-Handshake. In legacy Dynamatic, this analysis was done at what we would call the CF level today.
I think this distinction is very artificial and, in the end, arbitrary. Memory interfaces are part of Handshake and are required in a functional dataflow circuit, so instantiating functional ones can (and should imo) be considered part of the translation. These interfaces also change what your control network looks like; by not placing them, you will also find yourself with a half-baked control network at the end of (1). One may put forth that we also do not place buffers or forks/sinks in
This is a straight up non sequitur. Having to rewrite the IR is not indicative of a bad design, it is the entire reason we have an IR in the first place and a toolchain to transform it (MLIR). Modern compilers are based on the principle of iterative refinement of the IR, where each refinement (i.e., compiler pass) is targeted at a specific aspect of the IR to analyze/optimize/transform. Pushing your argument to its logical limit, we should just go from C to RTL without an IR because we do not want to do something that we will have to go back on later. (As a sidenote, we never split LSQs into multiple LSQs. We may however split one into a master MC and a slave LSQ.)
This is much too vague to have a meaningful discussion about. What are the operands and results of those new "abstract ports"? How do they connect to the rest of the circuit before you actually place memory interfaces (where do load/store addresses go, where does loaded data come from)? Who produces the Handshake function's memory end control output before memory interfaces exist? MLIR will not let you exit a pass with an IR that has bits and pieces hanging around. If it looks like these new operations do not fit in Handshake and also do not fit in CF, as you seem to suggest, then it is perhaps because they do not actually represent anything meaningful at any IR level. In addition, dialects are meant to group operations/types/attributes which live at roughly the same abstraction level, so creating a new dialect for this couple of operations seems undesirable given that there is no clear definition of the kind of abstraction this dialect would represent. As it is abundantly clear I am very much against this proposal and will remain as such until someone can actually showcase a use case where this redesign would allow someone to do something significantly more easily than with the current design. As I have mentioned, the system is made so that one can replace memory interfaces whenever one wants, and I am sure this could be made even easier with some lightweight code refactoring. Furthermore, I believe the proposed change does not increase modularity in any useful way. The IR coming out of an hypothetical " Note that any argument of the kind "nobody should want to run a pass between I hope that this makes clear the implementation effort that would be associated with this change and puts its usefulness in perspective considering that you can basically do whatever you want today with the existing system. Also note the following.
tl;dr. Modulo much clearer evidence of actual limitations in the current design and a more concrete implementation plan for its replacement I am not going to approve of such contribution if given the power to. I will try to end on a more positive note and point out that, if you want to go ahead regardless of my insights, you can do so in a non-destructive way for the existing flow. |
Beta Was this translation helpful? Give feedback.
-
|
@lucas-rami: thanks for the extensive and timely answer. Most certainly, we do not understand much of the entrails of Dynamatic and we appreciate the insights--in fact, this issue was entirely designed to elicit your reaction.... :-) Before I move to a more concrete part of this comment let me cite this:
Indeed, we briefly discussed this among ourselves but I think ultimately pushed this under the carpet. I think you are somehow right there: whoever messes up with the memory connectivity (LSQs and MCs) most likely needs to mess up also with the so-called control network.... Now, let me try to summarize the issue as succinctly as I can and please be patient with the many stupidities I may say along the way:
I think the simplest use-case of what @rpirayadi wants to do may help here: he does not care who has created a circuit and how (FPGA'18 or FPL'22+FPGA'23 or anything) but, instead of inserting an LSQ, he wants to create a virtual 0-bit data dependency between the result of a predecessor (in program order) access and the next one, so that the accesses are kept in order and the LSQ unnecessary. Of course, this comes to a great damage to performance--and ultimately @rpirayadi's goal will be to do better--but it is a perfect example of a later pass (later than If we sort of agree, is ISFPT'19 an example of a pass @rpirayadi should get direct inspiration from for his pass? If we do not agree, what would be the best way for @rpirayadi to go about the kind of (many different) passes he would like to develop? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the quick reply and and for clarifying your use-case, it helps me see what was the intended goal of the change :) I think we generally agree!
Exactly. This is a largely mechanical change to implement, and it will especially have a positive impact on anyone wanting to change memory interfaces after some have already been placed. It would most likely take me 3/4 hours to implement, and I can do my best to schedule that at some point in the near-future before much of that implementation effort has started.
Yes, that is basically what I was trying to point out. I think the existing system is very workable for the needs you have (if I understand correctly) and changing it carries yet-undefined risks that I am not sure outweigh the benefits, which as I have pointed out I am not convinced of either.
Yes, replacing the memory interfaces is a matter of calling a few functions and there are clear examples of this in both the FPGA'18 and ISFPT'19 passes. I am sure there are opportunities for lightly refactoring the existing code that would make this even easier (e.g., some reusable parts of the interface replacement may still be private to a pass and so would need to be exposed publicly), and I can provide some guidance if needed at some point.
Yes, when taken together with the
These To conclude, the ISFPT'19 pass and the memory replacement pass are the best places to look for inspiration on this. These are relatively short passes as well, so they should not be too hard to parse. I hope that clarified a bit more my thoughts on all of this. Don't hesitate to let me know in case of questions. |
Beta Was this translation helpful? Give feedback.
-
|
@lucas-rami, thanks for the comments and the confirmation that we are essentially on the right track to understand things.
Maybe we should wait for other stakeholders (@lana555 in particular) to approve, but from my side I would be grateful if you could enact this change as early as possible. I am sure that @rpirayadi would be happy to help, but I guess it is more efficient if you do it yourself--your choice. In hindsight, I have the feeling that we have been induced in a misunderstanding by the It seems to me that the only difference from the suggestions by @rpirayadi and what you say is there already is minor: we had in mind a flow
whereas what is implemented already is a
The moment there is a @lucas-rami: Please do tell me if I am missing the point somehow or if my representation of the situation is too simplistic in any way. @rpirayadi: Do you agree? Can you start looking into @lana555: Do you agree with all this? Do you agree on the removal of the "the |
Beta Was this translation helpful? Give feedback.
-
|
Yes, it makes sense to remove handshake::MCLoadOp/handshake::LSQLoadOp, as long as it is changed consistently across the flow (I suppose there are some minor changes to do in several passes down to RTL generation...). |
Beta Was this translation helpful? Give feedback.
-
|
Sorry for the pedantic question, but I am assuming @lucas-rami has offered to take care of this change in all of the released code base (I am not sure how to say this properly--I guess I mean whatever is in |
Beta Was this translation helpful? Give feedback.
-
|
I read most of the thread and I want to give another motivation for why we should unite LSQLoadOp and MCLoadOp: Currently, MCLoad has a transparent buffer slot between memory -> circuit but LSQLoad doesn't have it; this implementation has a deadlock risk (documented in #136) |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @Jiahui17! Could we agree that future memory operations will have the absolute minimum number of embedded buffer slots that makes sense functionally (perhaps zero) and that the responsibility of avoid deadlocks is somewhere else (e.g., automatically happening in the buffering pass or in an appropriate pass right before implementation or in any other place where this buffer is made explicit)? I think it would make for a more clean design.... |
Beta Was this translation helpful? Give feedback.
-
|
@paolo-ienne I agree. In particular here, moving the buffer into the memory controller may make sense. It is up to the memory controller to figure out how to safely share the load port (here I mean memory controller -> BRAM), and it is not the circuit's job to put buffers here and there to avoid deadlocks. |
Beta Was this translation helpful? Give feedback.
-
|
First, I want to thank @lucas-rami for his thoughtful responses. I also want to let you know I am ready to help for conducting changes in unifying Second, in answer to @paolo-ienne, I looked again at There is still something unclear for me which I think may be irrelevant to the rest of the discussion, but I'll just mention it briefly. That is what raised this discussion in the first place. I need to run the Fast Token Delivery (FPGA'22) after doing my changes. However, the current version @pcineverdies is implementing is implemented in the
But, as @lucas-rami suggest breaking this would result in getting some meaningless intermediate result. It's still unclear to me how we can reuse the functionality of Fast Token Delivery. I would appreciate any comments regarding this concern. |
Beta Was this translation helpful? Give feedback.
-
|
I would say that #177 is the good location to have a discussion around Fast Token Delivery! (in terms of requirements, necessities, issues...) |
Beta Was this translation helpful? Give feedback.
-
That sounds good to me!
Yes I will merge this on
Thanks for bringing this up, this is super relevant and I had semi-forgotten about it. When I make this refactoring we will end up with a single timing model for load ports and a single one for store ports, at least temporarily, which obviously doesn't align with our two different RTL implementations for each port type. Is it a good time to also unify these ports? Perhaps by making the implementation with a buffer inside the unique one for both MCs and LSQs (at least until someone figures out a cleaner solution, like putting them in the memory interfaces themselves)?
Thanks for providing more context, it helps me understand what the end goal is. I agree with @pcineverdies that the bulk of this conversation should probably go on #177, but just to provide some insight now that I have a clearer idea of what you want to do. You don't have to break down |
Beta Was this translation helpful? Give feedback.
-
|
@lucas-rami: Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Apologies for the long comment but I spent a little time to try to figure out what is the right way to go about all this (specifically, this issue and the related one #185). Not sure I have the conclusive answer but here is my view. This is the way I look to the most generic and clean flow, ignoring for the moment the production of
In substance, I see (1) the completion output of stores as compulsory and not optional. Also, I believe (2) the correct implementation of a memory system is a combination of circuitry (deliberate or occasional; possibly using the completion signals of memory operations) and of correctly connected memory components. The job of I guess we can discuss Orthogonally to all of the above, I would like to add the following: The FPT'19 path essentially analyzes the topology of the circuit to discover if memory accesses are intrinsically sequentialized; if they are, potential dependency annotations can be removed (at least this is how I understand it works). If passes by @rpirayadi create circuitry that explicitly sequentializes some memory accesses, a run of the FPT'19 pass (or of some conceptually small generalization) should be able to remove those dependencies that @rpirayadi was trying to take care of. This seems both elegant to me (i.e., the FPT'19 pass is simply a pass that removes unnecessary dependency annotations, whatever the reason) and sound (i.e., a pass constructs a circuit with the intent of removing dependencies but a different analysis pass will decide if they are actually removable). I hope I make sense. (I have a side comment to all this, in case we finally agree that stores need to have a completion signal: should not the load operation too have this completion signal? Yes, I understand that it is exactly a simple dataless fork of the load output, but would it not be more elegant to design the load with the very same signal inside instead of outside? We can check this in detail, but I am ready to bet that a dataless fork inside the load connected to a sink outside of it would be completely cancelled by logic synthesis.) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We believe there is an issue in how the memory operations are currently handled in Dynamatic. This issue is discussed in the following section. A possible solution is proposed afterward, and finally, the impact this change has on Dynamatic is discussed.
Problem definition
Firstly, keep in mind that the issue we would like to highlight exists for both types of memory operations. However, for simplicity, we focus on load operations in the description. Currently, in the pass
CfToHandshakememory operations are converted frommemref::LoadOpto eitherhandshake::MCLoadOporhandshake::LSQLoadOp. As a result, the decision to choose between the memory controller and LSQ must be made in this pass. Moreover, thehandshake::LSQLoadOphas a port namedLSQLoadPortwhich needs to be connected to an LSQ, which forces us to instantiate the LSQ at this point. This process combines two logically distinct actions: (1) translating from CF to the Handshake representation and (2) connecting the memory operations to the chosen memory structure. We believe the two actions (ie. translation and connection) should ideally be separated into two different passes to maintain a clean and modular design.One of the manifestations of this issue is the current implementation of Shrink it or Shred it!(ICFPT'19). This transformation is implemented in the Handshake MLIR because it requires the handshake structure. However, this pass also modifies memory connectivity (namely by removing LSQs or splitting a single LSQ into multiple LSQs). This highlights that these connections should ideally not have been established in the earlier transformation.
In conclusion, it appears that the decision on how to connect memory operations to memory controller or LSQ is premature. Thus, it makes more sense to postpone the connection to memory controller or LSQ until our knowledge is complete.
Possible solution
A possible solution is to introduce a new type of load (and also store) operation possibly named
handshake::GeneralLoadOp, which would eventually be translated into eitherhandshake::MCLoadOporhandshake::LSQLoadOpbased on the annotated information. Another approach could be to introduce an entirely new dialect between CF and Handshake which holds the general operation. This would solve the inconsistency of having two types of Handshake representation (ie. one before the final translation and one after).Impact
Implementing the proposed solution would have some impacts. Every pass that modifies the connectivity of memory operations should now annotate all the required information in
handshake::GeneralLoadOpwhich would eventually be translated intohandshake::MCLoadOpandhandshake::LSQLoadOp. The advantage is that this transformation is done uniformly, regardless of the changes done by the previous passes. Previous passes can alter the result simply by modifying the annotation on the operation.One of the passes that are affected, is
--lower-cf-to-handshake, since it needs to convert CF operations intohandshake::GeneralLoadOp. To instantiate the LSQ we need the dominance information which in turn requires the control flow information, so one of the things that should be preserved and carried through is the control flow information. Other required modifications are in--handshake-analyze-lsq-usageand--handshake-replace-memory-interfaceswhich implement Shrink it or Shred it!. The modification in these two passes mainly consists of not connecting the load operation into different ports, but instead annotating where it should be connected inside the operation. Finally, we believe that the proposed solution leaves all the other passes unchanged.Given this proposed approach and its potential impacts, we would appreciate your thoughts and opinions on this solution.
Beta Was this translation helpful? Give feedback.
All reactions