Implementation of Fast Token Delivery flow within Dynamatic #253
Replies: 25 comments 3 replies
-
|
@pcineverdies Thanks for formatting the PR! I will forward the issue to experts on this (TBH I am not familiar with the compiler things that you are doing here). In any case, here are some thoughts on the plan: Comment on step 1:
I think the goal is to make the code introduced in every PR clean (which is the whole point of having a plan like this); now since we all agree that the Boolean library is redundant, it is better to clean it up in the first place. Comment on step 2 and 3:
Isn't the plan to do a conversion pass from SSA to GSA, and then do something like Update after talking to others |
Beta Was this translation helpful? Give feedback.
-
|
Thanks a lot for doing this, it is great to be able to see a plan like this before PRs start pouring in. Few comments below.
Just to be clear, I (and everybody else I assume) have 0 emotional attachment to
Maybe we already all agree on this but I don't think that MLIR can truly support a GSA representation in the IR so it seems difficult to implement an "SSA to GSA" transformation pass (unless you are relying on special operation attributes perhaps?). To me an analysis pass seems like the right choice for this, as this allows you to maintain whatever "GSA information" you need in memory instead of in the IR itself. |
Beta Was this translation helpful? Give feedback.
-
|
For what regards the boolean library, I agree that it's important to come up with a clean implementation. Right now, GSA is implemented as an analysis pass which provides information to Here is an example just for clarity. C code unsigned loop_sum(in_int_t a[N]) {
int x = 0;
for (int i = 0; i < N; i++) x += a[i];
return x;
}cf dialect module {
func.func @loop_sum(%arg0: memref<8xi32> {handshake.arg_name = "a"}) -> i32 {
%c0 = arith.constant {handshake.name = "constant2"} 0 : index
%c0_i32 = arith.constant {handshake.name = "constant3"} 0 : i32
cf.br ^bb1(%c0, %c0_i32 : index, i32) {handshake.name = "br0"}
^bb1(%0: index, %1: i32): // 2 preds: ^bb0, ^bb1
%c8 = arith.constant {handshake.name = "constant4"} 8 : index
%c1 = arith.constant {handshake.name = "constant5"} 1 : index
%2 = memref.load %arg0[%0] {handshake.mem_interface = #handshake.mem_interface<MC>, handshake.name = "load1"} : memref<8xi32>
%3 = arith.addi %1, %2 {handshake.name = "addi0"} : i32
%4 = arith.addi %0, %c1 {handshake.name = "addi1"} : index
%5 = arith.cmpi ult, %4, %c8 {handshake.name = "cmpi0"} : index
cf.cond_br %5, ^bb1(%4, %3 : index, i32), ^bb2 {handshake.name = "cond_br0"}
^bb2: // pred: ^bb1
return {handshake.name = "return0"} %3 : i32
}
}gsa dialect (explicit phi without block arguments, see [2] for semantics) module {
func.func @loop_sum(%arg0: memref<8xi32> {handshake.arg_name = "a"}) -> i32 {
%c0 = arith.constant {handshake.name = "constant2"} 0 : index
%c0_i32 = arith.constant {handshake.name = "constant3"} 0 : i32
cf.br ^bb1 {handshake.name = "br0"}
^bb1(): // 2 preds: ^bb0, ^bb1
%0 = gsa.mu(%c0, %4); :index
%1 = gsa.mu(%c0_i32, %3); :i32
%c8 = arith.constant {handshake.name = "constant4"} 8 : index
%c1 = arith.constant {handshake.name = "constant5"} 1 : index
%2 = memref.load %arg0[%0] {handshake.mem_interface = #handshake.mem_interface<MC>, handshake.name = "load1"} : memref<8xi32>
%3 = arith.addi %1, %2 {handshake.name = "addi0"} : i32
%4 = arith.addi %0, %c1 {handshake.name = "addi1"} : index
%5 = arith.cmpi ult, %4, %c8 {handshake.name = "cmpi0"} : index
cf.cond_br %5, ^bb1, ^bb2 {handshake.name = "cond_br0"}
^bb2: // pred: ^bb1
return {handshake.name = "return0"} %3 : i32
}
}On the one hand, this latter approach seems cleaner, and it would allow to handle GSA gates as normal operations in the IR (rather than having a custom data-structure with the GSA information). On the other hand, as Lucas mentioned, Currently - in our working branch - we are relying on the analysis pass, but it's also immediate to introduce a |
Beta Was this translation helpful? Give feedback.
-
That's what I meant, I am 99% sure MLIR won't let you do what you do with FYI, there is another kind of region called a graph region which allows you to do such things as using an SSA value before it is defined in the IR (a non-external Therefore I strongly suggest going for an analysis pass. I am not saying you could not dirtily hack around a semi-explicit version of GSA in the IR itself, but I'm saying it won't be practical to use, assuming it is even doable without losing one's sanity. |
Beta Was this translation helpful? Give feedback.
-
|
@pcineverdies @lucas-rami thank you for the clarifications and discussions! The current plan sounds good to me. I will give a look into #147 |
Beta Was this translation helpful? Give feedback.
-
|
Following the discussion on issue #185 and specifically this comment, I would like to mention that making the Fast Token Delivery code more modular, so I can run it later on over the |
Beta Was this translation helpful? Give feedback.
-
|
I am currently working in this direction. In this way, From an abstract point of view, this would end up in two functions which look like this:
The possibility of using FTD outside of the conversion pass introduces however a few problems. I am going to list all of them, maybe this helps us in finding a solution :)
%98 = not %109 {ftd.skip, handshake.bb = 7 : ui32, handshake.name = "not23"} : <i1>EDIT: this requirement for FTD does not compromise #186 and #187. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the insights on your current/future API @pcineverdies, that's helpful. That's what I meant when I said that sometimes things are much easier conceptually than in the implementation ;)
I'm actually quite in favor of doing that. I never got around to do it because there has never been a direct need so far but I think it makes a lot of sense and will be almost surely necessary to do other things in the future (e.g., for adapting CFDFC identification to work with FTD circuits). Possibly something to add on this tracking issue and implement at some point.
Not a fan either, especially if it's only useful in the "I want to call this function from outside the pass" case. Would it be possible to somehow expose the analysis that derives which operations should be skipped? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for your suggestions! For the former point, not only we need to move the CFG information around, but we also need a way to use MLIR/LLVM APIs over it (for instance, to use the EDIT: I was thinking that what's really convenient to have is the block structure as it is just before the call to For the latter: right now the operations are marked according to different categories as they are created ("you should be always skipped, you should be skipped only in this case, you should be skipped in this other case..."). Maybe I end up finding a way to rescue the information out of the IR topology itself. Might be back on this point! :) |
Beta Was this translation helpful? Give feedback.
-
Not sure I understand your idea there. Do you want to maintain a "shell CF function" in the IR post conversion that basically just has the original control flow? If that's it then I am against keeping around IR operations that represent no actual hardware in the end just for the sake of future analysis. That's the job of attributes, which is why I liked you idea of annotating the original CFG on Handshake functions during the conversion. At some point I looked at the |
Beta Was this translation helpful? Give feedback.
-
|
As I have dived a bit deeper into the topic, I'd like to show you my current proposal to the problem "we need to maintain the CFG structure after the conversion from As I mentioned earlier, it is really convenient to have the block structure in the IR itself as it happens in the However (and we are absolutely on the same page here) after the conversion is done, nothing except hardware should be in the handshake version of the circuit. This holds for the final outputs of the transformation pass. However, I believe that it might be beneficial to temporary build the block structure again throughout a transformation pass, exploit the functionalities we need and then flatten the IR again. I'll show you what I have in mind with an example ( We start with the following function in the module {
func.func @fir(%arg0: memref<1000xi32> {handshake.arg_name = "di"}, %arg1: memref<1000xi32> {handshake.arg_name = "idx"}) -> i32 {
%c0 = arith.constant {handshake.name = "constant2"} 0 : index
%c0_i32 = arith.constant {handshake.name = "constant3"} 0 : i32
cf.br ^bb1(%c0, %c0_i32 : index, i32) {handshake.name = "br0"}
^bb1(%0: index, %1: i32): // 2 preds: ^bb0, ^bb1
%c999 = arith.constant {handshake.name = "constant5"} 999 : index
%c1000 = arith.constant {handshake.name = "constant6"} 1000 : index
%c1 = arith.constant {handshake.name = "constant7"} 1 : index
%2 = memref.load %arg1[%0] {handshake.mem_interface = #handshake.mem_interface<MC>, handshake.name = "load2"} : memref<1000xi32>
%3 = arith.subi %c999, %0 {handshake.name = "subi0"} : index
%4 = memref.load %arg0[%3] {handshake.mem_interface = #handshake.mem_interface<MC>, handshake.name = "load3"} : memref<1000xi32>
%5 = arith.muli %2, %4 {handshake.name = "muli0"} : i32
%6 = arith.addi %1, %5 {handshake.name = "addi0"} : i32
%7 = arith.addi %0, %c1 {handshake.name = "addi2"} : index
%8 = arith.cmpi ult, %7, %c1000 {handshake.name = "cmpi0"} : index
cf.cond_br %8, ^bb1(%7, %6 : index, i32), ^bb2 {handshake.name = "cond_br0"}
^bb2: // pred: ^bb1
return {handshake.name = "return0"} %6 : i32
}
}While converting to handshake, we don't to anything special. However, we keep track of the "edges" between basic blocks. In this case, it's important to remember that:
As an annotation, we might store something like this (syntax is just as a reference): {[
{"src": 0, "dstT":1, "dstF": 1, "cond": ""},
{"src": 1, "dstT":1, "dstF": 2, "cond": "cmpi0"},
]}We would end-up with an handshake IR with nothing but handshake operations, together with (annotated somewhere in the module {
handshake.func @fir(%arg0: memref<1000xi32>, %arg1: memref<1000xi32>, %arg2: !handshake.control<>, %arg3: !handshake.control<>, %arg4: !handshake.control<>, ...) -> (!handshake.channel<i32>, !handshake.control<>, !handshake.control<>, !handshake.control<>) attributes {argNames = ["di", "idx", "di_start", "idx_start", "start"], resNames = ["out0", "di_end", "idx_end", "end"]} {
%outputs, %memEnd = mem_controller[%arg1 : memref<1000xi32>] %arg3 (%addressResult) %result_16 {connectedBlocks = [1 : i32], handshake.name = "mem_controller0"} : (!handshake.channel<i32>) -> !handshake.channel<i32>
%1 = constant %arg4 {handshake.bb = 0 : ui32, handshake.name = "constant3", value = 0 : i32} : <i32>
[...]
%2 = br %arg4 {handshake.bb = 0 : ui32, handshake.name = "br1"} : <>
%trueResult, %falseResult = cond_br %3, %30 {handshake.bb = 1 : ui32, handshake.name = "cond_br1"} : <i1>, <i32>
[...]
%trueResult_14, %falseResult_15 = cond_br %31, %result {handshake.bb = 1 : ui32, handshake.name = "cond_br7"} : <i1>, <>
%result_16, %index_17 = control_merge %falseResult_15 {handshake.bb = 2 : ui32, handshake.name = "control_merge1"} : <>, <i1>
end {handshake.bb = 2 : ui32, handshake.name = "end0"} %trueResult_2, %memEnd_1, %memEnd, %arg4 : <i32>, <>, <>, <>
}
}{[{"src": 0, "dstT":1, "dstF": 1, "cond": ""},{"src": 1, "dstT":1, "dstF": 2, "cond": "cmpi0"},]}Now, each of the current transformation passes is not affected by this new setup. What I can do is to parse the edge information annotated, and temporary create some blocks with module {
handshake.func @fir(%arg0: memref<1000xi32>, %arg1: memref<1000xi32>, %arg2: !handshake.control<>, %arg3: !handshake.control<>, %arg4: !handshake.control<>, ...) -> (!handshake.channel<i32>, !handshake.control<>, !handshake.control<>, !handshake.control<>) attributes {argNames = ["di", "idx", "di_start", "idx_start", "start"], resNames = ["out0", "di_end", "idx_end", "end"]} {
%1 = constant %arg4 {handshake.bb = 0 : ui32, handshake.name = "constant3", value = 0 : i32} : <i32>
[...]
%4 = br %arg4 {handshake.bb = 0 : ui32, handshake.name = "br1"} : <>
cf.br ^bb1(%c0, %c0_i32 : index, i32) {handshake.name = "br0"}
^bb1(): // 2 preds: ^bb0, ^bb1
%trueResult, %falseResult = cond_br %3, %30 {handshake.bb = 1 : ui32, handshake.name = "cond_br1"} : <i1>, <i32>
[...]
%31 = cmpi ult, %30, %25 {handshake.bb = 1 : ui32, handshake.name = "cmpi0"} : <i32>
%trueResult_14, %falseResult_15 = cond_br %31, %result {handshake.bb = 1 : ui32, handshake.name = "cond_br7"} : <i1>, <>
cf.cond_br %31, ^bb1(), ^bb2 {handshake.name = "cond_br0"}
^bb2: // pred: ^bb1
%result_16, %index_17 = control_merge %falseResult_15 {handshake.bb = 2 : ui32, handshake.name = "control_merge1"} : <>, <i1>
end {handshake.bb = 2 : ui32, handshake.name = "end0"} %trueResult_2, %memEnd_1, %memEnd, %arg4 : <i32>, <>, <>, <>
}
}{[{"src": 0, "dstT":1, "dstF": 1, "cond": ""},{"src": 1, "dstT":1, "dstF": 2, "cond": "cmpi0"},]}This clearly makes no sense from a circuit perspective, but it allows us to have all the features we used to have in the conversion pass. module {
handshake.func @fir(%arg0: memref<1000xi32>, %arg1: memref<1000xi32>, %arg2: !handshake.control<>, %arg3: !handshake.control<>, %arg4: !handshake.control<>, ...) -> (!handshake.channel<i32>, !handshake.control<>, !handshake.control<>, !handshake.control<>) attributes {argNames = ["di", "idx", "di_start", "idx_start", "start"], resNames = ["out0", "di_end", "idx_end", "end"]} {
%outputs, %memEnd = mem_controller[%arg1 : memref<1000xi32>] %arg3 (%addressResult) %result_16 {connectedBlocks = [1 : i32], handshake.name = "mem_controller0"} : (!handshake.channel<i32>) -> !handshake.channel<i32>
%1 = constant %arg4 {handshake.bb = 0 : ui32, handshake.name = "constant3", value = 0 : i32} : <i32>
[...]
%2 = br %arg4 {handshake.bb = 0 : ui32, handshake.name = "br1"} : <>
%trueResult, %falseResult = cond_br %3, %30 {handshake.bb = 1 : ui32, handshake.name = "cond_br1"} : <i1>, <i32>
[...]
%trueResult_14, %falseResult_15 = cond_br %31, %result {handshake.bb = 1 : ui32, handshake.name = "cond_br7"} : <i1>, <>
%result_16, %index_17 = control_merge %falseResult_15 {handshake.bb = 2 : ui32, handshake.name = "control_merge1"} : <>, <i1>
end {handshake.bb = 2 : ui32, handshake.name = "end0"} %trueResult_2, %memEnd_1, %memEnd, %arg4 : <i32>, <>, <>, <>
}
}{[{"src": 0, "dstT":1, "dstF": 1, "cond": ""},{"src": 1, "dstT":1, "dstF": 2, "cond": "cmpi0"},]}In this way, the sketchy representation is only intermediate and the final output is always consistent with the handshake expectations. What do you think? So far I have made some experiments while still in the conversion pass, and it seems to work without any issue. After the call to |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for elaborating on your proposal. If that is the only way to go (I can't think of another) then I think it is ok, even though it feels like kind of a hack. Few comments below.
I don't think you need to store the name of the condition's producer if you just care about the CFG. You will anyway create a "dummy" structure so it doesn't need to have proper conditionals on the One last thought: have you considered creating a full func-level function with your desired CFG instead of modifying the Handshake function itself, then just deleting the func-level function entirely? I am thinking it may be easier because you won't have any "real" operation inside to care about in the process. Just create the function, insert your blocks and terminators as desired, run the analysis, then delete the entire thing. Feel free to ignore this if it is actually harder to implement than what you have, just wanted to share the idea in case you didn't already have it. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks! I am finding my way around through the implementation (I had one minor problem yesterday, I hope it's just a matter of modifying some dependencies in I see the problem of the naming. Having the operation determining the edge to be taken might (70% sure) might be useful; however - while optimizing the working code before a PR - we might end up stating that such an information is not necessary. On the contrary, having operations inside the blocks (thus, moving them back and forth across them) is necessary to easily access information related to the relationship between the operation themselves and the CFG. Just to give you an idea, it's about With respect to the "hacky" way, I also see it as a dirty approach. However, after these few months working on the project, I realized that the handshake level - being really in between the software and the hardware - always requires going back and forth the two abstractions. My idea comes out of this thought: "It's so simple to manipulate the IR in In the end, if I'm not wrong, the main contributions of such a proposal in the current codebase are the revisits of both ICFPT'19 and buffer placement (CFDFC?), getting rid of the |
Beta Was this translation helpful? Give feedback.
-
You may need to add the cf dialect as a dependent dialect in the pass declaration in
The relationship between operations and the CFG is currently handled by the
I think I need to clarify a few things here. Major information does get lost in the process, and this is unavoidable considering that we are moving from a CFG-based software-like IR to a pure graph (a CDFG in our jargon). From Handshake forward, the original CFG no longer exists---and I mean that from a conceptual point of view, not just as in "it's no longer visible in the IR"---even if we want to believe really hard that it still does. While some of our downstream algorithms rely on us remembering what that CFG looked like (hence the A good example is FTD circuits and the CFDFC identification logic for buffer placement. These two are currently incompatible because FTD circuits "mess up the CDFG to CFG correspondence" (for good reasons) enough so that the current approach fails to identify to which CFG cycles some edges belong. Most likely this is fixable because FTD circuits mess up the CFG/CDFG in a predictable/limited way that we can probably recover from (though I don't think anybody ever bothered to make a formal argument for this, it looks feasible in practice). However, tomorrow someone will invent a new even more aggressive optimization that transforms the circuit to such an extent that we won't be able to semantically link some operations to any BB that used to exist, and we should be able to handle that i.e., we should not rely on the original CFG. All of this to say that, every time we rely on the CFG being recoverable in Handshake, we are a little bit lying to ourselves. We cross our fingers and hope that the circuit has not been transformed too hard since conversion from CF, but this is obviously unreliable. Nobody is talking of getting rid of Pragmatically, I still think it makes sense to maintain the original CFG as an annotation on the function because we will most likely need it in the near future, but I only personally care about the list of BBs and edges between them. |
Beta Was this translation helpful? Give feedback.
-
|
I'm back on this topic. The API for this
A couple of thoughts over what you have mentioned in your past few messages. The whole need of the control flow structure is to run fast token delivery later in the flow. If we add new operations and new connections after the conversion pass itself, we need to make sure that those new operations undergo the same flow adopted in the conversion stage.
This is clearly the main issue in this requirement. I already encountered a couple of cases in which, due to a renaming, it was not possible to restore the CFG structure.
However, each of these solutions have some issues:
As you said the best way to go would be to live without this information, but, for our scopes, this seems unavoidable. |
Beta Was this translation helpful? Give feedback.
-
|
What I mentioned above is sort of orthogonal to the conversion pass using fast token delivery which, on the contrary, is now functional (the integration tests working without FTD are also working with FTD - clearly this is not a metric of perfection but I consider it as a significant milestone). This is the code structure:
You can have a clue of what is going on in my worknig branch. If it looks okay to you, in the next few weeks I plan to go for 5/6 different PRs, each of them containing the above points (I also include the analysis passes as I needed some modifications since the last PRs). The core of the code is in the FTD will be run from the compilation flow using a custom flag: The conversion pass runs through a class called FTD cannot currently use the ICFPT'19 functionalities; also, only the "on-merges" buffer placement is available. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the summary, I was starting to loose track of some things so it is very helpful.
Thanks for thinking this through so well. The best solution I can think of is a mix of all three. First, and on your alternative 3, we should try to make our existing passes preserve operation names when an operation is replaced with "something equivalent". Many passes do not do that yet. (Initially we designed the "namer" so that it would explicitly disallow name reusing, but many people care about these names being consistent so at some point I relaxed this constraint and created a method on the
Awesome, thanks for thinking of breaking these things up in multiple contributions, and for the detailed plan :) As we approach end-of-the-year holidays I may not be very responsive depending on when you make your PRs but I will do my best to keep an eye on the project.
Perfectly fine. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks guys for all the discussions!
I wonder if this issue can be solved by using attributes to label the operations that calculate the conditions of basic blocks, and to parse these new attributes in the CFG reconstruction, instead of relying on operation names? We could then enforce the propagation of such attributes across the different passes in the same way we propagate the Here is one way of applying this on @pcineverdies's earlier example, assuming we call the new attribute Any thoughts? |
Beta Was this translation helpful? Give feedback.
-
|
Operation names are already attributes just like The only case relying on operation names would probably fail is if we start having passes delete the operations responsible for producing the Now that I think of this more however. Have you considered not relying on the operation producing the |
Beta Was this translation helpful? Give feedback.
-
|
It looks like the problem is currently caused by the pass |
Beta Was this translation helpful? Give feedback.
-
The maintenance problem remains equally for all attributes, but with
This is not general enough because, if I understand it correctly, it requires that any branch in a particular BB has an |
Beta Was this translation helpful? Give feedback.
-
To slightly correct my own phrasing there since we would technically still be relying on an operation attribute, namely |
Beta Was this translation helpful? Give feedback.
-
I guess I see the benefit over using names (maintaining names remains as much as possible best-effort, maintaining this new attribute is more important for functionality), despite the fact that it is an additional burden to manage down the pipeline. Just to give some direction for implementing this, the way to manage such an attribute would be through an analysis that is queried and maintained valid by most (all?) Dynamatic passes. Using analyses for this kind of cross-pass concern is something we should do more often, including on things that already exist (
We cannot really force this, but with an analysis you can make it relatively easy for a third-party pass to honor your requirement. Wihtout going into long details, the general idea is the same as the
Sounds good, thanks for clarifying why that would not work as is. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
I was told that you're using an "Init" operation (buffer with an initial token) for your implementation. Do you have any guidance on how I can use such an operation myself? Thanks! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
The work from Unleashing Parallelism in Elastic Circuits with Faster Token Delivery [1] is still missing in the codebase. The goal of this issue is to outline a roadmap for their implementation, highlighting what needs to be done.
The Fast Token Delivery (FTD) methodology is a different algorithm to build the handshake IR of the the cf IR. It gets rid of the concepts of basic blocks in the original CFG, relying only on the relationship between producers and consumers among different operations. In the paper, it was shown it can provide significant execution time advantage, together with a reduction of area utilization.
The methodology aims at getting rid of anything related to control dependencies between basic blocks. However, at a first stage of the implementation, the network of control merges is maintained. This allows to allocate the memory operations to the LSQ, and to indicate the termination of the execution of the kernel to the memory interfaces.
The idea of the issue is to design an alternative version of the current
CfToHandshakepass (let's call itFtdCfToHandshake) which can be enabled through a flag at compile time (compile --fast-token-delivery). In this way, the current version of Dynamatic requires no change.Also, since this methodology is about the conversion from the
cfdialect to thehandshakedialect, the rest of the Dynamatic flow remains unchanged (considering both the transformations to the handshake IR and the following lowering to thehwdialect).Steps
Finish a Boolean Logic library in Dynamatic.
The FTD methodology requires many analysis of boolean conditions within the circuit (it is mainly about understanding which conditions must be satisfied to go from node A to node B following a given path). A draft of this library is already present in the
Experimentalfolder. However, some new features must be added in order for it to be used in the algorithm. We are aware that having a custom library might be redundant in the codebase - taking into account that Espresso is already among the dependencies of the project. Once that FTD is implemented, we aim at replacing the current custom library with something else.Implement an analysis pass to obtain control dependency information on the CFG.
This analysis pass (based on [4]) is a fundamental step of the FTD algorithm and the GSA conversion.
Implement an analysis pass which extracts the GSA information out of the MLIR's SSA representation.
The FTD methodology is based on GSA, an alternative version to SSA which employs boolean conditions together with phi nodes (phi nodes are not explicit in MLIR, but encoded through block arguments in the cf dialect). There are many papers showing how to approach such a problem, such as [2] and [3]. The main consequence of having GSA is that, instead of using merge nodes for phi functions, we can go for multiplexers.
Implement the FTD methodology according to [1].
This step is the largest among the ones presented in the issue, as it requires to fully rewrite the CfToHandshake pass. In this first phase, the way memories are handled remains identical to what is done in the current Dynamatic (thus without SQ, using the network of control merges to allocate blocks in the LSQ). This can be further divided in the following steps:
startsignal to trigger constants and undefined values in the circuit (intermediate non complete step);Additional comments
As I mentioned above, the rest of the flow is not modified.
Some issues might arise from the interaction with other passes.
Since this is something not strictly related to the FTD algorithm (but rather, on the consistency of the whole flow) some new issues will be opened later on.
[1] A. Elakhras, A. Guerrieri, L. Josipović, and P. Ienne, “Unleashing parallelism in elastic circuits with faster token delivery,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), IEEE, 2022, pp. 253–261. Accessed: Oct. 14, 2024. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10035134/
[2] P. Havlak, “Construction of thinned gated single-assignment form,” in Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds., Berlin, Heidelberg: Springer, 1994, pp. 477–499. doi: 10.1007/3-540-57659-2_28.
[3] P. Tu and D. Padua, “Efficient Building and Placing of Gating Functions,” ACM SIGPLAN Notices, vol. 30, no. 6, pp. 47–55, Jan. 1995, doi: 10.1145/223428.207115.
[4] J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization,” ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319–349, Jul. 1987, doi: 10.1145/24039.24041.
Beta Was this translation helpful? Give feedback.
All reactions