[Handshake] Postponing the connection of memory operations to LSQ or Memory Controller #254

rpirayadi · 2024-11-09T14:51:44Z

rpirayadi
Nov 9, 2024
Collaborator

We believe there is an issue in how the memory operations are currently handled in Dynamatic. This issue is discussed in the following section. A possible solution is proposed afterward, and finally, the impact this change has on Dynamatic is discussed.

Problem definition

Firstly, keep in mind that the issue we would like to highlight exists for both types of memory operations. However, for simplicity, we focus on load operations in the description. Currently, in the pass CfToHandshake memory operations are converted from memref::LoadOp to either handshake::MCLoadOp or handshake::LSQLoadOp. As a result, the decision to choose between the memory controller and LSQ must be made in this pass. Moreover, the handshake::LSQLoadOp has a port named LSQLoadPort which needs to be connected to an LSQ, which forces us to instantiate the LSQ at this point. This process combines two logically distinct actions: (1) translating from CF to the Handshake representation and (2) connecting the memory operations to the chosen memory structure. We believe the two actions (ie. translation and connection) should ideally be separated into two different passes to maintain a clean and modular design.

One of the manifestations of this issue is the current implementation of Shrink it or Shred it!(ICFPT'19). This transformation is implemented in the Handshake MLIR because it requires the handshake structure. However, this pass also modifies memory connectivity (namely by removing LSQs or splitting a single LSQ into multiple LSQs). This highlights that these connections should ideally not have been established in the earlier transformation.

In conclusion, it appears that the decision on how to connect memory operations to memory controller or LSQ is premature. Thus, it makes more sense to postpone the connection to memory controller or LSQ until our knowledge is complete.

Possible solution

A possible solution is to introduce a new type of load (and also store) operation possibly named handshake::GeneralLoadOp, which would eventually be translated into either handshake::MCLoadOp or handshake::LSQLoadOp based on the annotated information. Another approach could be to introduce an entirely new dialect between CF and Handshake which holds the general operation. This would solve the inconsistency of having two types of Handshake representation (ie. one before the final translation and one after).

Impact

Implementing the proposed solution would have some impacts. Every pass that modifies the connectivity of memory operations should now annotate all the required information in handshake::GeneralLoadOp which would eventually be translated into handshake::MCLoadOp and handshake::LSQLoadOp. The advantage is that this transformation is done uniformly, regardless of the changes done by the previous passes. Previous passes can alter the result simply by modifying the annotation on the operation.

One of the passes that are affected, is --lower-cf-to-handshake, since it needs to convert CF operations into handshake::GeneralLoadOp. To instantiate the LSQ we need the dominance information which in turn requires the control flow information, so one of the things that should be preserved and carried through is the control flow information. Other required modifications are in --handshake-analyze-lsq-usage and --handshake-replace-memory-interfaces which implement Shrink it or Shred it!. The modification in these two passes mainly consists of not connecting the load operation into different ports, but instead annotating where it should be connected inside the operation. Finally, we believe that the proposed solution leaves all the other passes unchanged.

Given this proposed approach and its potential impacts, we would appreciate your thoughts and opinions on this solution.

paolo-ienne · 2024-11-09T15:15:32Z

paolo-ienne
Nov 9, 2024
Maintainer

I very strongly support this modification because I believe CfToHandshake mixes up in the same pass completely orthogonal concerns: converting a CDFG into a circuit (what we often refer informally as FPGA'18 or FPL'22+FPGA'23) and connecting the appropriate memory system to achieve correct execution. As @rpirayadi suggests, there are many ways to achieve the second: many depend on the actual elastic circuit (e.g., the so-called ICFPT'19) and some may modify the elastic circuit (e.g., techniques @rpirayadi plans to work on that would add elastic circuitry to remove LSQs in favour of normal memory connections). I understand that there are some technical reasons for the current implementation but I think we need to find a workaround for such technical constraints because they force us into a fundamentally wrong software organization where orthogonal concerns are combined.

It would be extremely useful to hear from @lucas-rami about (1) whether the nonconnected annotated handshake::GeneralLoadOp memory operation is the right way to go and (2) how many things will possibly break doing this (@rpirayadi suggests not that many--is that optimistic?)

0 replies

lucas-rami · 2024-11-09T19:00:27Z

lucas-rami
Nov 9, 2024
Maintainer

While I appreciate the effort in explaining this proposal in some details, I am deeply against this change. I think it is based on multiple misconceptions of how one works with memory interfaces in Dynamatic. Not only that, but it fails to actually explain in a meaningful way how this would work were it to be implemented.

Before I address the actual proposal, I want to put forward one design decision that I made in the past and for which I think there is a good argument to be made against (perhaps this will also clear up some things about how memory ports work). I decided at some point that load/store ports to MCs and LSQs should be different MLIR operations (for loads, handshake::MCLoadPort and handshake::LSQLoadPort) even though they have exactly the same operands/results and, from an abstract circuit perspective, behave in the same way. This choice was motivated by the fact that the RTL components we have for load and store ports are not the same depending on whether they connect to an MC or LSQ (why that is still escapes my understanding, but that it is not the topic). I now think that this discrimination at the Handshake level does not bring anything useful to the IR (it is anyway possible to determine which type of interface any port connects to by looking at adjacent operations in the IR) but it does create some light code duplication in places and is generally confusing. I would therefore be in favor of seeing it removed i.e., having a single handshake::LoadOp port for all loads and the same for stores without changing how memory interfaces are represented or when they are placed. If this is something you agree with, I can dedicate some time to it in the near future.

Now onto the actual proposal. I will first address the misconceptions in the problem statement one-by-one, then comment on the proposed solution, and finally give my conclusion as to why this is not a desirable change.

As a result, the decision to choose between the memory controller and LSQ must be made in this pass.

To make this clear, the decision is made here with the amount of information available at this stage of the compilation because I want to produce a circuit with a valid memory network straight away. There is no loss of information in the placement process, on the contrary most of that information is now more clearly laid out in the CDFG as opposed to attributes. Absolutely nothing prevents you from going back on this decision later on once more information is available; it is in fact fairly easy to do so thanks to memory interface creation/replacement helpers I have introduced over time which are already used both from CfToHandshake and from HandshakeReplaceMemoryInterfaces.

Moreover, the handshake::LSQLoadOp has a port named LSQLoadPort which needs to be connected to an LSQ, which forces us to instantiate the LSQ at this point.

Incorrect,LSQLoadPort has nothing to do with the IR itself. It is a view into an operand/result range of an handshake::LSQOp that corresponds to a load port (to help client code parse the many ports attached to an interface). It is therefore semantically attached to a memory interface, not a port, which is why it is queried from the former and not the latter. You would still need this even with your proposal.

This transformation is implemented in the Handshake MLIR because it requires the handshake structure.

Not really. We made the design decision to do it here but the analysis part is perfectly doable pre-Handshake. In legacy Dynamatic, this analysis was done at what we would call the CF level today.

This process combines two logically distinct actions: (1) translating from CF to the Handshake representation and (2) connecting the memory operations to the chosen memory structure.

I think this distinction is very artificial and, in the end, arbitrary. Memory interfaces are part of Handshake and are required in a functional dataflow circuit, so instantiating functional ones can (and should imo) be considered part of the translation. These interfaces also change what your control network looks like; by not placing them, you will also find yourself with a half-baked control network at the end of (1). One may put forth that we also do not place buffers or forks/sinks in CfToHandshake as a contradiction to what I am arguing for. However, these are very different beasts in my view because they do not non-trivially impact the CDFG; forks/sinks are trivially deducible at any point in the flow (and their temporary non-existence facilitates a number of downstream transformations) and buffers, in addition to not modifying the CDFG beyond a simple channel cut, are not placeable in any performance-oriented way before basically all circuit transformations have been applied (due to latency/throughput changes).

However, this pass also modifies memory connectivity (namely by removing LSQs or splitting a single LSQ into multiple LSQs). This highlights that these connections should ideally not have been established in the earlier transformation.

Thus, it makes more sense to postpone the connection to memory controller or LSQ until our knowledge is complete.

This is a straight up non sequitur. Having to rewrite the IR is not indicative of a bad design, it is the entire reason we have an IR in the first place and a toolchain to transform it (MLIR). Modern compilers are based on the principle of iterative refinement of the IR, where each refinement (i.e., compiler pass) is targeted at a specific aspect of the IR to analyze/optimize/transform. Pushing your argument to its logical limit, we should just go from C to RTL without an IR because we do not want to do something that we will have to go back on later. (As a sidenote, we never split LSQs into multiple LSQs. We may however split one into a master MC and a slave LSQ.)

A possible solution is to introduce a new type of load (and also store) operation possibly named handshake::GeneralLoadOp, which would eventually be translated into either handshake::MCLoadOp or handshake::LSQLoadOp based on the annotated information. Another approach could be to introduce an entirely new dialect between CF and Handshake which holds the general operation. This would solve the inconsistency of having two types of Handshake representation (ie. one before the final translation and one after).

This is much too vague to have a meaningful discussion about. What are the operands and results of those new "abstract ports"? How do they connect to the rest of the circuit before you actually place memory interfaces (where do load/store addresses go, where does loaded data come from)? Who produces the Handshake function's memory end control output before memory interfaces exist? MLIR will not let you exit a pass with an IR that has bits and pieces hanging around. If it looks like these new operations do not fit in Handshake and also do not fit in CF, as you seem to suggest, then it is perhaps because they do not actually represent anything meaningful at any IR level. In addition, dialects are meant to group operations/types/attributes which live at roughly the same abstraction level, so creating a new dialect for this couple of operations seems undesirable given that there is no clear definition of the kind of abstraction this dialect would represent.

As it is abundantly clear I am very much against this proposal and will remain as such until someone can actually showcase a use case where this redesign would allow someone to do something significantly more easily than with the current design. As I have mentioned, the system is made so that one can replace memory interfaces whenever one wants, and I am sure this could be made even easier with some lightweight code refactoring.

Furthermore, I believe the proposed change does not increase modularity in any useful way. The IR coming out of an hypothetical "CFToHandhsakeNoMemInterface" pass would be missing part of its control network in addition to the memory network (no memory interfaces -> no control signals to/from them -> incomplete set of control signals in the circuit) i.e., memory interfaces are not exactly orthogonal to the rest of the circuit creation process. Many downstream passes will therefore prefer to see a circuit with memory interfaces present, even if they do not really care for the memory network's topology itself. Your "ConnectMemoryInterfaces" pass would need to run before these passes, which themselves might desire to run before any "AdditionalMemoryAnalysis". However, following your proposal the latter would require that memory interfaces have not already been placed, so now you need to write and run another pass that removes memory interfaces and goes back to the still unclear GeneralLoadOp representation of memory ports. Let me put this graphically because I can see this being hard to track from text. From top to bottom, problematic chronological order of passes.

<passes from source to CF>
CFToHandhsakeNoMemInterface (your point 1 in the first paragraph of the problem def.)
ConnectMemoryInterfaces (your point 2, needed to legalize the IR for the next passes. Together with the previous pass, this is now exactly CfToHandhsake)
PassThatWantsAFullControlNetwork1
<...>
PassThatWantsAFullControlNetworkN
RemoveMemoryInterfaces (needed to legalize the IR for the next pass)
AdditionalMemoryAnalysis (e.g., existing `HandshakeAnalyzeLSQUsage`)
ConnectMemoryInterfaces (once one is done with all memory analysis stuff)
<passes down to RTL>

Note that any argument of the kind "nobody should want to run a pass between CFToHandhsakeNoMemInterface and my last AdditionalMemoryAnalysis pass so that I don't have to do this back and forth on port representation" would directly decrease modularity, and as such go against your stated goal.

I hope that this makes clear the implementation effort that would be associated with this change and puts its usefulness in perspective considering that you can basically do whatever you want today with the existing system. Also note the following.

There is no need for any major refactoring if any "AdditionalMemoryAnalysis" pass understands that there are already unoptimized but functional memory interfaces in the circuit, like the HandshakeAnalyzeLSQUsage pass easily does, as well as potentially previously placed annotations on memory ports that carry extra information not already encoded in the CDFG.
If multiple passes of the "AdditionalMemoryAnalysis" kind need to run sequentially, you do not have to actually replace memory interfaces at the end of each pass. Much like HandshakeAnalyzeLSQUsage only re-annotates existing memory ports and relies on the HandshakeReplaceMemoryInterfaces to actually change the interfaces, you can run all your analysises sequentially---iteratively refining port annotations---and then call the replacement pass once at the end.

tl;dr. Modulo much clearer evidence of actual limitations in the current design and a more concrete implementation plan for its replacement I am not going to approve of such contribution if given the power to.

I will try to end on a more positive note and point out that, if you want to go ahead regardless of my insights, you can do so in a non-destructive way for the existing flow. CfToHandshake is made so that you can reuse parts of the transformation and ignore/modify others. Without much code duplication, you can therefore create an alternative pass that uses most parts of the original and do whatever you want with memories in it. Then you are free to implement any downstream pass that will gladly accept whatever IR you spit out of your alternative and finally connect memory interfaces once you are done so that you can run the rest of the Dynamatic flow unchanged. I am much more likely to approve of such experimental contributions, as they allow to try out a different design without hurting the regular flow and simply at the cost of person-hours if the alternative turns out to be sub-optimal.

0 replies

paolo-ienne · 2024-11-10T00:15:53Z

paolo-ienne
Nov 10, 2024
Maintainer

@lucas-rami: thanks for the extensive and timely answer. Most certainly, we do not understand much of the entrails of Dynamatic and we appreciate the insights--in fact, this issue was entirely designed to elicit your reaction.... :-)

Before I move to a more concrete part of this comment let me cite this:

The IR coming out of an hypothetical "CFToHandhsakeNoMemInterface" pass would be missing part of its control network in addition to the memory network (no memory interfaces -> no control signals to/from them -> incomplete set of control signals in the circuit) i.e., memory interfaces are not exactly orthogonal to the rest of the circuit creation process.

Indeed, we briefly discussed this among ourselves but I think ultimately pushed this under the carpet. I think you are somehow right there: whoever messes up with the memory connectivity (LSQs and MCs) most likely needs to mess up also with the so-called control network....

Now, let me try to summarize the issue as succinctly as I can and please be patient with the many stupidities I may say along the way:

The distinction handshake::MCLoadPort and handshake::LSQLoadPort is unwise because the operation is the same--who and how the request is served is irrelevant to the operation (at least until much later), much as we have an 'Add' and we do not distinguish on what is the right implementation, RippleAdd or CLAdd (the comparison is not perfect, but somehow not fundamentally wrong either). It seems that you agree on this and you would even be available to help to remove this.
The fact that everything from *LoadPort components to actual memory is connected very early (at CFToHandshake time) is perhaps significant at semantic/MLIR level (I feel like I would challenge this, but it is actually pointless here) but it is certainly insignificant at a practical level if one could easily (i) rip off every existing connection and (ii) rebuild that connectivity (semi-)automatically. I take that this is what you signify when pointing to the helpers in MemoryInterfaces. Let me tentatively say this: we are qualitatively in agreement if (a) what you connect "too early" (for our taste, but who cares...) in CFToHandshake is simply the result of a call to (some) function(s) and (b) the reconnection later is simply the result of a call to some other function ripping off the previous connection and a call to the same function(s) to reconnect. Is this, with some imagination to compensate for my potentially inept description, the qualitative case we are in or we agree we should be in? If we are, clearly we have nothing to change (except perhaps for the handshake::MCLoadPort and handshake::LSQLoadPort unification, perhaps).

I think the simplest use-case of what @rpirayadi wants to do may help here: he does not care who has created a circuit and how (FPGA'18 or FPL'22+FPGA'23 or anything) but, instead of inserting an LSQ, he wants to create a virtual 0-bit data dependency between the result of a predecessor (in program order) access and the next one, so that the accesses are kept in order and the LSQ unnecessary. Of course, this comes to a great damage to performance--and ultimately @rpirayadi's goal will be to do better--but it is a perfect example of a later pass (later than CFToHandshake and later than ISFPT'19) that he might want to write--I hope it is clear what I mean here and if it is not, please ask for more details. I now think this may be (largely) compatible with the existing code base if one is to ignore what has been implemented in terms of memory interfaces and could simply get the existing one ripped off and recreated. (1) Would the pass implementing ISFPT'19 be a right example for this where @rpirayadi could see this "rip off and recreate automatically" in action? (2) If so, how would one "communicate" to the helper function constructing the memory connectivity that the handshake::LSQLoadPort should now be an handshake::MCLoadPort? (I hope you see what I mean, for even if there is only a universal handshake::LoadPort someone has to make a decision on whether it is an LSQ port or an MC port, after all. Would this an annotation to leave on the handshake::LoadPort node or what?

If we sort of agree, is ISFPT'19 an example of a pass @rpirayadi should get direct inspiration from for his pass? If we do not agree, what would be the best way for @rpirayadi to go about the kind of (many different) passes he would like to develop?

0 replies

lucas-rami · 2024-11-10T16:38:32Z

lucas-rami
Nov 10, 2024
Maintainer

Thanks for the quick reply and and for clarifying your use-case, it helps me see what was the intended goal of the change :) I think we generally agree!

The distinction handshake::MCLoadPort and handshake::LSQLoadPort is unwise because the operation is the same [...] It seems that you agree on this and you would even be available to help to remove this.

Exactly. This is a largely mechanical change to implement, and it will especially have a positive impact on anyone wanting to change memory interfaces after some have already been placed. It would most likely take me 3/4 hours to implement, and I can do my best to schedule that at some point in the near-future before much of that implementation effort has started.

but it is certainly insignificant at a practical level if one could easily (i) rip off every existing connection and (ii) rebuild that connectivity (semi-)automatically

Yes, that is basically what I was trying to point out. I think the existing system is very workable for the needs you have (if I understand correctly) and changing it carries yet-undefined risks that I am not sure outweigh the benefits, which as I have pointed out I am not convinced of either.

Is this, with some imagination to compensate for my potentially inept description, the qualitative case we are in or we agree we should be in? If we are, clearly we have nothing to change (except perhaps for the handshake::MCLoadPort and handshake::LSQLoadPort unification, perhaps).

Yes, replacing the memory interfaces is a matter of calling a few functions and there are clear examples of this in both the FPGA'18 and ISFPT'19 passes. I am sure there are opportunities for lightly refactoring the existing code that would make this even easier (e.g., some reusable parts of the interface replacement may still be private to a pass and so would need to be exposed publicly), and I can provide some guidance if needed at some point.

(1) Would the pass implementing ISFPT'19 be a right example for this where @rpirayadi could see this "rip off and recreate automatically" in action?

Yes, when taken together with the HandshakeReplaceMemoryInterfaces pass. The HandshakeAnalyzeLSQUsage pass (ISFPT'19) simply re-annotates memory ports to encode the dependence analysis's result (without changing the CDFG yet). HandshakeReplaceMemoryInterfaces then reads these annotations (without caring where they came from) and replaces existing memory interfaces with new ones according to those. The IR you get back has memory interfaces which now match memory port annotations. If no annotation encoded the need for an LSQ then your output IR will not have LSQs, even if the input IR contained LSQs.

(2) If so, how would one "communicate" to the helper function constructing the memory connectivity that the handshake::LSQLoadPort should now be an handshake::MCLoadPort? (I hope you see what I mean, for even if there is only a universal handshake::LoadPort someone has to make a decision on whether it is an LSQ port or an MC port, after all. Would this an annotation to leave on the handshake::LoadPort node or what?

These *Port types are somewhat unrelated to all those passes; they are neither annotations or operations, just helpers when writing code that deals with memory interfaces. The actual annotation (attribute in MLIR jargon) that the ISFPT'19 pass sets is called MemInterfaceAttr (documentation somewhat outdated) and allows to encode whether the port should ultimately go to an MC or LSQ. The *Port types (which will get simpler once the handshake::MCLoadOp/handshake::LSQLoadOp nonsense is gone) are just data structures that can be queried by the user from a memory interface to better understand what it connects to (the nature and number of ports, their groupings, etc.). They won't be a problem in this effort.

To conclude, the ISFPT'19 pass and the memory replacement pass are the best places to look for inspiration on this. These are relatively short passes as well, so they should not be too hard to parse. I hope that clarified a bit more my thoughts on all of this. Don't hesitate to let me know in case of questions.

0 replies

paolo-ienne · 2024-11-10T18:14:39Z

paolo-ienne
Nov 10, 2024
Maintainer

@lucas-rami, thanks for the comments and the confirmation that we are essentially on the right track to understand things.

Exactly. This is a largely mechanical change to implement, and it will especially have a positive impact on anyone wanting to change memory interfaces after some have already been placed. It would most likely take me 3/4 hours to implement, and I can do my best to schedule that at some point in the near-future before much of that implementation effort has started.

Maybe we should wait for other stakeholders (@lana555 in particular) to approve, but from my side I would be grateful if you could enact this change as early as possible. I am sure that @rpirayadi would be happy to help, but I guess it is more efficient if you do it yourself--your choice.

In hindsight, I have the feeling that we have been induced in a misunderstanding by the handshake::MCLoadOp/handshake::LSQLoadOp "nonsense", for your description "[...] re-annotates memory ports to encode the dependence analysis's result (without changing the CDFG yet). HandshakeReplaceMemoryInterfaces then reads these annotations (without caring where they came from) and replaces existing memory interfaces with new ones according to those" matches completely what we (or at least I) had in mind except for what looks to me an irrelevant detail.

It seems to me that the only difference from the suggestions by @rpirayadi and what you say is there already is minor: we had in mind a flow

keep changing the annotations (attributes) and then add the interface only once at the end

whereas what is implemented already is a

keep changing the annotations (attributes) and keep generating a tentative interface until it won't change anymore

The moment there is a HandshakeReplaceMemoryInterfaces that is the functional equivalent of what we would have called a HandshakeCreateMemoryInterfaces, I see any difference as irrelevant to our needs and I agree we should change nothing to the overall architecture (I understand what you say about the light refactoring and I think @rpirayadi would not have any problem to do that).

@lucas-rami: Please do tell me if I am missing the point somehow or if my representation of the situation is too simplistic in any way.

@rpirayadi: Do you agree? Can you start looking into HandshakeAnalyzeLSQUsage and HandshakeReplaceMemoryInterfaces to see how similar they are to what you plan to do? My sense is that you want to write some replacement of HandshakeAnalyzeLSQUsage that also changes somehow the circuit besides modifying the annotations on the memory ports, and then HandshakeReplaceMemoryInterfaces should work for you (almost?) as is.

@lana555: Do you agree with all this? Do you agree on the removal of the "the handshake::MCLoadOp/handshake::LSQLoadOp nonsense"? Any other concern?

0 replies

lana555 · 2024-11-11T11:53:41Z

lana555
Nov 11, 2024
Maintainer

Yes, it makes sense to remove handshake::MCLoadOp/handshake::LSQLoadOp, as long as it is changed consistently across the flow (I suppose there are some minor changes to do in several passes down to RTL generation...).

0 replies

paolo-ienne · 2024-11-11T12:16:58Z

paolo-ienne
Nov 11, 2024
Maintainer

Sorry for the pedantic question, but I am assuming @lucas-rami has offered to take care of this change in all of the released code base (I am not sure how to say this properly--I guess I mean whatever is in main), right? If this is not the case, we should be more careful before we launch ourselves into this.

0 replies

Jiahui17 · 2024-11-11T13:51:10Z

Jiahui17
Nov 11, 2024
Maintainer

I read most of the thread and I want to give another motivation for why we should unite LSQLoadOp and MCLoadOp:

Currently, MCLoad has a transparent buffer slot between memory -> circuit but LSQLoad doesn't have it; this implementation has a deadlock risk (documented in #136)

0 replies

paolo-ienne · 2024-11-11T13:59:48Z

paolo-ienne
Nov 11, 2024
Maintainer

Thanks @Jiahui17!

Could we agree that future memory operations will have the absolute minimum number of embedded buffer slots that makes sense functionally (perhaps zero) and that the responsibility of avoid deadlocks is somewhere else (e.g., automatically happening in the buffering pass or in an appropriate pass right before implementation or in any other place where this buffer is made explicit)? I think it would make for a more clean design....

0 replies

Jiahui17 · 2024-11-11T14:20:41Z

Jiahui17
Nov 11, 2024
Maintainer

@paolo-ienne I agree. In particular here, moving the buffer into the memory controller may make sense.

It is up to the memory controller to figure out how to safely share the load port (here I mean memory controller -> BRAM), and it is not the circuit's job to put buffers here and there to avoid deadlocks.

0 replies

rpirayadi · 2024-11-11T14:33:13Z

rpirayadi
Nov 11, 2024
Collaborator Author

First, I want to thank @lucas-rami for his thoughtful responses. I also want to let you know I am ready to help for conducting changes in unifying LSQLoadOp and MCLoadOp, if you want.

Second, in answer to @paolo-ienne, I looked again at HandshakeAnalyzeLSQUsage and HandshakeReplaceMemoryInterfaces. The thing I want to do, does not have much analysis more than what is already done, but instead it consists of changing the circuit and adding some components. However, beside this somehow irrelevant point, I think I can use HandshakeReplaceMemoryInterfaces as is.

There is still something unclear for me which I think may be irrelevant to the rest of the discussion, but I'll just mention it briefly. That is what raised this discussion in the first place. I need to run the Fast Token Delivery (FPGA'22) after doing my changes. However, the current version @pcineverdies is implementing is implemented in the CFToHandshake pass (actually he is writing his own version of CFToHandshake). That is what caused us to think it is better to break CFToHandshake into smaller passes so we can somehow reuse the functionality. And the first thing that came to our mind was to separate these two parts.

(1) translating from CF to the Handshake representation and (2) connecting the memory operations to the chosen memory structure.

But, as @lucas-rami suggest breaking this would result in getting some meaningless intermediate result. It's still unclear to me how we can reuse the functionality of Fast Token Delivery. I would appreciate any comments regarding this concern.

0 replies

pcineverdies · 2024-11-11T15:20:50Z

pcineverdies
Nov 11, 2024
Collaborator

I would say that #177 is the good location to have a discussion around Fast Token Delivery! (in terms of requirements, necessities, issues...)

0 replies

lucas-rami · 2024-11-12T00:16:50Z

lucas-rami
Nov 12, 2024
Maintainer

@paolo-ienne

Please do tell me if I am missing the point somehow or if my representation of the situation is too simplistic in any way.

That sounds good to me!

but I am assuming @lucas-rami has offered to take care of this change in all of the released code base (I am not sure how to say this properly--I guess I mean whatever is in main), right?

Yes I will merge this on main. I will actually make a regular PR for the sake of notifying relevant stakeholders. I will try to land this during the week, otherwise next week.

@Jiahui17

Currently, MCLoad has a transparent buffer slot between memory -> circuit but LSQLoad doesn't have it; this implementation has a deadlock risk (documented in #136)

Thanks for bringing this up, this is super relevant and I had semi-forgotten about it. When I make this refactoring we will end up with a single timing model for load ports and a single one for store ports, at least temporarily, which obviously doesn't align with our two different RTL implementations for each port type. Is it a good time to also unify these ports? Perhaps by making the implementation with a buffer inside the unique one for both MCs and LSQs (at least until someone figures out a cleaner solution, like putting them in the memory interfaces themselves)?

@rpirayadi

But, as @lucas-rami suggest breaking this would result in getting some meaningless intermediate result. It's still unclear to me how we can reuse the functionality of Fast Token Delivery. I would appreciate any comments regarding this concern.

Thanks for providing more context, it helps me understand what the end goal is. I agree with @pcineverdies that the bulk of this conversation should probably go on #177, but just to provide some insight now that I have a clearer idea of what you want to do. You don't have to break down CfToHandshake (or any future alternative version) into multiple actual passes---which I am not convinced is doable in any meaningful way---to be able to reuse code. With some smart design choices, you could potentially factor part of what the pass does as an independent piece of code that transforms the IR in a specific way and "call" that piece of code from whatever pass you are implementing. Note that while that may look easy to do from a conceptual perspective the reality of implementation often makes factoring out such a processing step much more difficult than it may look at first sight, so I invite everyone involved in these inter-operating pieces of code to discuss this among yourselves and try to determine what is realistically doable.

0 replies

paolo-ienne · 2024-11-12T06:35:11Z

paolo-ienne
Nov 12, 2024
Maintainer

@lucas-rami: Thanks!

0 replies

paolo-ienne · 2024-11-26T10:47:38Z

paolo-ienne
Nov 26, 2024
Maintainer

Apologies for the long comment but I spent a little time to try to figure out what is the right way to go about all this (specifically, this issue and the related one #185). Not sure I have the conclusive answer but here is my view.

This is the way I look to the most generic and clean flow, ignoring for the moment the production of mem_end:

Initial memory analyses (polyhedral, etc.) mark potential dependencies across memory operations as attributes. The call to some HandshakeCreateOrReplaceMemoryInterfaces will instantiate (i) elementary memory ports when no dependencies need to be enforced and (ii) LSQs where there are dependencies. Since LSQs take care of the dependencies, the completion signals of the stores should all be connected to sinks. I note here that connecting the completion outputs of the stores is not a job of HandshakeCreateOrReplaceMemoryInterfaces but of the memory analysis pass (or a very closely related pass)--see at the end for the rationale.
Further memory analyses, at any point in the flow, may remove or modify such dependencies (e.g., FPT'19, somehow later, may decide that some dependencies are implicitly honoured and remove them). The call to HandshakeCreateOrReplaceMemoryInterfaces will do the same as above but, possibly, implement less or simpler LSQs. Since, again, LSQs take care of the dependencies, the completion signals of the stores are still connected to sinks--nobody touched that.
Other passes (see @rpirayadi's planned one) may decide to build some circuitry that ensures that some particular dependencies are artificially honoured through the circuit construction. Examples include (1) statically sequentialize all potentially dependent memory accesses or (2) dynamically sequentialize all potentially dependent memory accesses (i.e., building some sort of LSQ logic elastically). All these actions replace attributes on the memory operations with circuitry that uses the completion signals to ensure that the result is semantically correct. The call to some HandshakeCreateOrReplaceMemoryInterfaces will do the same as above, but most likely it will reduce the number of LSQs down to potentially none at all, if all dependency annotations have been removed (because the circuit enforces them).
Another pass not outside of the scope of what @rpirayadi plans, I think, may add a memory barrier in a particular place specified by the programmer through a pragma. The effect would be to force the sequentialization of the set of memory accesses before the barrier and the set after the barrier, naturally enforced by detecting the completion of the last access of the former set and using it to gate the first access of the latter set. An interesting use case is that a simple memory analysis would detect that some existing dependencies are now intrinsically enforced by the barrier. Now, the call to HandshakeCreateOrReplaceMemoryInterfaces will do the same as above, but original LSQs spanning accesses before and after the barrier would now be implemented as separate and smaller LSQs for the accesses before and after the barrier, separately. This is useful for @coulon, if his elastic CGRA will have only limited-size LSQ.

In substance, I see (1) the completion output of stores as compulsory and not optional. Also, I believe (2) the correct implementation of a memory system is a combination of circuitry (deliberate or occasional; possibly using the completion signals of memory operations) and of correctly connected memory components. The job of HandshakeCreateOrReplaceMemoryInterfaces is to look for anything that is not yet taken care of (i.e., the residual dependencies) and construct with LSQs or other components an appropriate memory system. Circuits based on memory completion signals and annotations are complementary ways to achieve semantic correctness--any combination of the two should work.

I guess we can discuss mem_end separately. A simple, but maybe not simplistic, view could be to say that there is initially a dependency from any memory access to the end node. Passes can remove all dependencies that are implicitly or transitively honoured. What is left is circuitry that needs to be build using completion signals, much as @rpirayadi plans to do things.

Orthogonally to all of the above, I would like to add the following: The FPT'19 path essentially analyzes the topology of the circuit to discover if memory accesses are intrinsically sequentialized; if they are, potential dependency annotations can be removed (at least this is how I understand it works). If passes by @rpirayadi create circuitry that explicitly sequentializes some memory accesses, a run of the FPT'19 pass (or of some conceptually small generalization) should be able to remove those dependencies that @rpirayadi was trying to take care of. This seems both elegant to me (i.e., the FPT'19 pass is simply a pass that removes unnecessary dependency annotations, whatever the reason) and sound (i.e., a pass constructs a circuit with the intent of removing dependencies but a different analysis pass will decide if they are actually removable). I hope I make sense.

(I have a side comment to all this, in case we finally agree that stores need to have a completion signal: should not the load operation too have this completion signal? Yes, I understand that it is exactly a simple dataless fork of the load output, but would it not be more elegant to design the load with the very same signal inside instead of outside? We can check this in detail, but I am ready to bet that a dataless fork inside the load connected to a sink outside of it would be completely cancelled by logic synthesis.)

0 replies

[Handshake] Postponing the connection of memory operations to LSQ or Memory Controller #254

Uh oh!

rpirayadi Nov 9, 2024 Collaborator

Problem definition

Possible solution

Impact

Replies: 15 comments

Uh oh!

Uh oh!

paolo-ienne Nov 9, 2024 Maintainer

Uh oh!

lucas-rami Nov 9, 2024 Maintainer

Uh oh!

paolo-ienne Nov 10, 2024 Maintainer

Uh oh!

lucas-rami Nov 10, 2024 Maintainer

Uh oh!

paolo-ienne Nov 10, 2024 Maintainer

Uh oh!

lana555 Nov 11, 2024 Maintainer

Uh oh!

paolo-ienne Nov 11, 2024 Maintainer

Uh oh!

Uh oh!

Jiahui17 Nov 11, 2024 Maintainer

Uh oh!

paolo-ienne Nov 11, 2024 Maintainer

Uh oh!

Uh oh!

Jiahui17 Nov 11, 2024 Maintainer

Uh oh!

rpirayadi Nov 11, 2024 Collaborator Author

Uh oh!

pcineverdies Nov 11, 2024 Collaborator

Uh oh!

lucas-rami Nov 12, 2024 Maintainer

Uh oh!

paolo-ienne Nov 12, 2024 Maintainer

Uh oh!

paolo-ienne Nov 26, 2024 Maintainer

rpirayadi
Nov 9, 2024
Collaborator

paolo-ienne
Nov 9, 2024
Maintainer

lucas-rami
Nov 9, 2024
Maintainer

paolo-ienne
Nov 10, 2024
Maintainer

lucas-rami
Nov 10, 2024
Maintainer

paolo-ienne
Nov 10, 2024
Maintainer

lana555
Nov 11, 2024
Maintainer

paolo-ienne
Nov 11, 2024
Maintainer

Jiahui17
Nov 11, 2024
Maintainer

paolo-ienne
Nov 11, 2024
Maintainer

Jiahui17
Nov 11, 2024
Maintainer

rpirayadi
Nov 11, 2024
Collaborator Author

pcineverdies
Nov 11, 2024
Collaborator

lucas-rami
Nov 12, 2024
Maintainer

paolo-ienne
Nov 12, 2024
Maintainer

paolo-ienne
Nov 26, 2024
Maintainer