Elastic Channels #348
Replies: 10 comments 24 replies
-
|
Thank you for the nice document explaining your point of view! @AyaElAkhras has organized a meeting today so I will not write too much beforehand, in case I change what I think, but for a brief immediate addition-
This point I agree with 100%, and should be something we heavily emphasize when dealing with any new project that deals with the type system.
I am not convinced here on the details of how types should be altered (but do not yet have a concrete counter-proposal, I would not 100% stand behind what we have currently implemented as the long-term solution), but I can write up something more helpful on this in the near future.
This I also agree with 100% (and discussed with @AyaElAkhras @shundroid and @DanaKossaybati last week)- we are currently operating under a system where operations are fully type-agnostic. A simple example is that adding a speculator to a circuit, and adding the "spec" annotation to the type system, are two fully independent steps. We then rely on the separate "type verification" system to check whether the IR has sensible type annotations- this has been built as a fully modular, customizable system, with rules that can be as complex or as simple as needed. One final (complex) thing to throw in the ring:
I think we need to have a conversation on what actually defines an operation: when I talk about this with @lana555, the recurring comment that comes up is: I want to be able to annotate this floating point adder to be variant A, with small area and poor performance, or variant B, with large area and high performance. And this considers an operation to be a "logical" construct, that an operation defines some relationship between input and output, but does not actually define any implementation characteristics at all. But one primary thing we use operations for are timing analysis, and to drive the buffering passes, which need to know how each operation combines delays from the various inputs. From a generation perspective, we can drop the concept of an operation entirely- we could have all computational units share the same operation type, with an implementation parameter that says "add the data payloads together using architecture variant 7" etc. etc. If the operation type does not define the implementation, but does define something, we should try be very clear what that something is. |
Beta Was this translation helpful? Give feedback.
-
Duplicate Signals in the PayloadOne aspect I completely overlooked before come to mind while discussing with @AyaElAkhras and @rpirayadi: if I Join, for instance, two payloads containing the signal Who Processes the Extra Signals?When an Add receives two operands with a Renaming and SteeringSome may have noticed a small glitch above: I suppressed |
Beta Was this translation helpful? Give feedback.
-
|
@emmet-murphy: thanks for all these lively exchanges! Call me dumb, though, but I am not clear what is the counterproposal you are putting forward--I just understand that you disagree with mine on a few specific points.... 😉 Can I read what you suggest somewhere? Could you give a comprehensive description, even only by difference, that I could relate to? I am truly sorry, but if I were to repeat what, overall, you are suggesting to do instead (or how you see the information flow) I would not know where to start. |
Beta Was this translation helpful? Give feedback.
-
|
I am answering primarily #348 (reply in thread) and #348 (reply in thread). I am not totally clear but I This does not seems unreasonable to me but I wonder if the advantages outweigh the shortcomings. There are some serious disadvantages I can think of: (i) If I am inventing some new way of constructing circuits, you are forcing me to go down to RTL generation to implement what I need, even if it is fundamentally unneeded (it is clearly neither for processing tags nor for speculative bits, as I tried to argue). (ii) It is not clear how many generators we will have, for I suppose the metadata processing logic is different for different components (it may be the same for Add and Mul, but is it the same for Branch?); I think it would be worth expressing more explicitly how many generators we will need to have and what parameters they will need to achieve generality--and yet such generality will be anyway fundamentally limited to what we can imagine today, whereas my proposal carries the whole payload and I cannot think of any limitation that will eventually push people to change existing RTL generators (can you?). (iii) I do not quite seem to understand how others see timing models; my view is essentially modelled on classic EDA modelling strategies (a library of characterized datapoints and interpolation between them) but if I start being essentially hierarchical (e.g., composing an arithmetic operator and a metadata processing block on the fly in a single RTL component), how do I get the timing behaviour? Does the RTL generator takes care also of the composition of the timing models and build them dymnamically. Or maybe is it the only option to say what someone has essentially said ("let's forget about it", #348 (reply in thread))? I think a large part of the perception on how lightweight these disadvantages are is based on the assumption that we know a couple of cases of extra bits (speculation and tagging) and we assume that nobody will ever invent something fundamentally different. I find this not to be future proof (who knows what metadata tomorrow someone will want to invent?) and this seems to me terrible for an experimental compiler that we would want as open and flexible as possible. I hear these advantages for the generators/wrappers: (i) The circuitry will be more efficient. This may be true in some cases (see #348 (reply in thread)) if implemented naively as I was suggesting, but it is certainly completely avoidable for all cases we know of today (e.g., nothing is needed for tagging because a single 1-output Fork, clearly stateless, is sufficient and naturally zero-delay; for speculation, my Fork-Or-Join can be implemented as a single stateless component identical to your logic in the wrapper and placed after or before the operator). Although perfectly identical circuitwise, a new component would be a separate component and, in many cases, an optional one: if I truly need or want it (and maybe I do not), much better to implement one more component/generator that touching a wrapper that serves already N purposes and that I now need to tweak to serve the (N+1)-th (we know how great is to touch something tricky that someone now gone wrote five years ago...). I think the advantage is simply not there. (ii) Otherwise, the Handshake IR will be more cluttered and some analyses will be made hard by the larger number of components. I see that and it is an objective difficulty. I would think it wiser to find other, maybe more general, ways to handle this, if it is truly a serious problem (compilers handle many thousands of instructions at once and people can live with that). One idea could be to have an attribute, possible for all components, called I think we should weight pros and cons very carefully. In an experimental compiler, to me the most valuable feature is generality and openness to things that, today, we do not have the slightest idea of. |
Beta Was this translation helpful? Give feedback.
-
|
@murphe67: On a different note, I am not clear (maybe there is a descriptions somewhere I have not read?) how you identify different parts of the extra bits for metadata processing. How would @AyaElAkhras instantiate in Handshake an Aligner to say that it aligns over |
Beta Was this translation helpful? Give feedback.
-
|
Some of the premises of this discussion are slightly at odd with the Type System Specs; the difference is mostly in a clean separation between data and metadata. Largely, whatever is written in this discussion can be very cosmetically adapted to that. |
Beta Was this translation helpful? Give feedback.
-
|
I close this discussion as it did not seem too successful in eliciting a discussion of the pros and cons of the various decisions, especially on reviewing how decisions at this level would impact RTL generation and timing analysis. We will probably need to come back to these issues at another point in time and in a more constructive setting. |
Beta Was this translation helpful? Give feedback.
-
|
@paolo-ienne It seems you just closed this discussion, but I was writing a comment before so please let me post it! @paolo-ienne Thanks for opening this discussion! I haven’t had time to read everything here, so apologies if I’ve misunderstood something. First, I want to clarify that this is purely an engineering issue, not a research question.
Unfortunately, this is incorrect. Generality in the current implementation does not imply openness to future changes—if anything, it often makes modifications harder. We’ve struggled with this over the past months. I believe this is a key reason behind the differing attitudes toward implementation, so I’ll explain in detail. Dynamatic’s codebase is inflexible and hard to understand precisely because it was written in an overly generalized way—especially for hypothetical features that never materialized. Generalization for vague future possibilities just adds complexity without real benefit even when those cases finally arise. For example, export-rtl abstracts RTL matching through A worse case is Handshake IR. A former engineer decided:
This abstraction—while may conceptually valid (e.g., memory controller doesn't belong to any basic block)—made it significantly harder to implement optimization passes requiring CFG representation (e.g., speculation, fast token delivery). The lesson here is that generalization in one direction often imposes restrictions in another. This concern applies to your proposal as well. The idea of "adding and removing payload" feels unnatural for speculation signals. Aside from out-of-order tags, the spec bit is never an independent channel—we don’t “add” or “remove” it. Instead, it’s simply used as an accompanying state at Now, to address your points:
We should generalize only for current needs.
"we should do only once"- I strongly disagree. That mindset leads to rigid, inflexible code. A healthy codebase allows for change when needed.
I’m not against conceptual clarity. But code should be optimized for the present, not for hypothetical future use cases. In conclusion, the codebase is not a place for brainstorming. It should reflect what is most useful and ideal for current needs. This clarity makes future generalization easier when truly necessary. I generally welcome discussions about future directions, but they should be based on or serve as counterproposals to concrete implementations. |
Beta Was this translation helpful? Give feedback.
-
|
I guess it is time to close this again.... 😉 |
Beta Was this translation helpful? Give feedback.
-
|
I wasn’t sure whether to post this, but I think it could be useful. Personally, I'm even not sure whether the concept of extra signals (or "payload with multiple signals," using @paolo-ienne's term) is permanent or not. I’ve talked this to @murphe67 since last December. While working on speculation, I realized the spec bit is more like a single shared state across the loop. It doesn’t need to accompany every channel—it can be wired directly to relevant units like save-commits and commits with proper buffering. I’m not familiar with out-of-order execution, but since most units (e.g., arithmetic ones) only accept aligned tags across inputs, I’m unsure why the tag needs to be tied to the data (wouldn’t a control-token-like tag accompanying control flow decisions be enough?). The only practical advantage of extra signals is that they share handshake logic with data. But something like the rigidifier can allow different channels to share handshake logic instead:
Here, the bold lines represent data channels, while the narrow lines represent extra signal (independent) channels. The shared_handshake op reorganizes the handshake logic internally, assigning the same valid/ready signals to all received channels. I mentioned this to @murphe67 last year, but I remember @gioelegott also shared the same idea with a clearer figure in the last meeting—feel free to share that figure if you don’t mind. So why did I implement speculation using extra signals? After discussing with Emmet, we agreed that our priority was getting speculation working in Dynamatic with our current understanding. It’s common to find better solutions during implementation, but delaying completion wasn’t an option—it’s already late since speculation is published, seemingly mostly due to the unmodifiable codebase. Alternating the approach requires additional engineering effort (including investigation and testing). (Also, another reason is Emmet didn’t seem fully convinced by this idea.) To be clear, I’m not arguing whether extra signals will be needed in the future or whether the signal manager should be implemented. My point is that future possibility is broader and unpredictable, and right now, the signal manager is essential for speculation and out-of-order execution. If, after speculation and out-of-order execution are fully implemented, someone sees a clear benefit in removing extra signals or replacing them with a payload system, they’re free to do so—but that should be based on their needs at that time, and on the actual implementation. This again ties back to the importance of code flexibility: future changes should be driven by actual implementation needs at that time, not by our present intentions. Finally, I want to stress that the ongoing backend refactoring is more than just handling extra signals—it’s a broader effort. |
Beta Was this translation helpful? Give feedback.






Uh oh!
There was an error while loading. Please reload this page.
-
Preliminaries
I decided to write this after reading #336. I do not disagree with the content there but (1) I wonder if it is general enough or just part of a bigger issue and (2) it is one of those significant changes that we should do only once and forever hold our peace.
Remark no. 1: This may end up being only noise. I apologize and invite you to close it quickly if it is. It is perfectly possible that this is a nonproblem (but some discussions I had suggests it is not) or that someone has already thought the perfect solution. Nobody will get offended if you ignore this! 😄
Remark no. 2: This is certainly related to various existing issues and discussions. I read some and got lost in others. I am afraid it is easier to me to write clearly my thoughts on the problem than to disseminate comments here and there. I think it is also more efficient for someone competent on the topic to read this and dismiss this as garbage than respond to various piecemeal comments in various places. I believe this is related to #336, to #330, and to #172, at the very least.
Remark no. 3: I am a total ignoramus of modern C++ and of MLIR. Take this as my view on how conceptually things should be represented and how information should flow. In my view, the actual result should be the most elegant and natural implementation of this in the programming environment we are into--I am no judge of that.
Remark no. 4: I define names for the entities I use to try to be clear and consistent. They are bold. Of course, my names may be perfectly suboptimal and can be replaced with anything better without affecting this discussion.
Elastic Channels
I think that anything connecting two components in Dynamatic is an elastic channel or, for simplicity, a channel. If for weird reasons there should be nonelastic connections anywhere, they should be completely different objects that in no way mix and match with channels; they may be called wires, maybe, but I will not discuss them here. I think the fact that every connection in a well formed Handshake circuit/graph is elastic is, among others, essential for buffering passes to be sound--and buffering passes are the heart of Dynamatic and, at least in principle, should always work on any Handshake circuit/graph. I understand that this is one of the elements reproached to the bundle/unbundle system.
Control and Payload
Channels being elastic, they always contain handshake signals (control) and contain a payload. Of course, control represents the usual two
ValidandReadysignals. The payload is a set of named ordinary signals (or group of wires, if you prefer). As said, the set may be empty, indicating what I think today is known as aControlType. Each signal in the payload has at least the following attributes: a signal name (e.g.,dataorspeculative_bit_region_3ortag_region_17) and a signal type of any appropriate sort (essentially the number of individual wires is what matters most, but then it could be something likei10to mean a ten-bit integer).Signal names are completely arbitrary except for one special name (
data). The payload of a particular channel may contain adatasignal composed of 16 bits, aspeculative_bit_region_3of 1 bit, and atag_region_17of 6 bits. The C frontend will most likely generate circuits whose channels have a payload that contains exclusively adatasignal (the C language value, probably corresponding to what is in theChannelTypetoday) or nothing.Adding and Removing Payload
To add and remove new signals in the payload, one uses the ordinary Join and Fork components with minimal adaptation. A Join takes two inputs with two payloads and produces an output whose payload is the union of the two payloads. A Fork does not necessarily send the complete payload to both outputs (this may be the default implementation, of course) but only the specified subset to each output (more on the backend implementation later). A degenerate 1-output Fork may be used to simply suppress (i.e., leave disconnected) some signals.
The idea is that such generalized
JoinsandForksserve effectively the same purpose as the deprecated bundle/unbundle operations but do it in a sound, fully elastic way. If @AyaElAkhras wants to add atagof 3 bits, a Join will do; to remove it, she will use a Fork. @rpirayadi will extract control tokens from memory addresses by forking a full payload (the unmodified address token) and an empty payload (the control token); he will gate an operation with a simple Join between a control token and an operand. @shundroid and @emmet-murphy will handle speculative bits much as mentioned before for tags.Managing Payload in the Backend
Firstly, please let's ignore for this discussion how the backend is implemented, whether we need an intermediate step between Handshake and Verilog, whether it is good to have it, etc. I know there are discussions about this, but here I am only concerned in what information is passed to what--not how.
Secondly, I now assume that every component will now be generated. Maybe it is not always necessary, but I prefer to assume that RTL languages are sufficiently horrible for not allowing any reasonable parametrization; if they are good enough, some generators may not be needed, but this changes nothing to the following discussion.
I think we need to consider three types of components, loosely defined as steering components, compute components, and the special case of
JoinandFork.Steering Components
Steering components do not care of what the payload is: a
Branch(let's ignore for now the selection input) needs only to know the number of bits to steer. At hardware generation, Dynamatic must add the number of bits in all signals in the payload and ask theBranchgenerator to generate a component with so many bits in the datapath. Similarly for any type ofBuffer(which makes perhaps thesteeringname not brilliant, but it is a nominalistic detail).Compute Components
Compute components will need to identify an operand out of the payload; by default this is
data. The generator of anAdd, for instance, may be told to create an adder for a 6-bit operand (dataof the first signal) and a 9-bit operand (dataof the second signal). It would be told also that the first operand is accompanied by 17 extra bits and the second operand by 3 extra bits. Of course, the extra bits are counted by adding the number of bits in all signals in the payload, except for the operands. In the normal case, this is trivial and there is nothing peculiar. The generatedAddwill naturally output a sum signal of 6 + 9 bits and 20 extra bits. By some simple conventions on the ordering of the extra bits, Dynamatic will know which extra bits from the hardware will correspond to which signals in the payload; the sum would normally be thedatasignal. This is easy to implement in hardware; maybe the best would have the actual hardware components with each channel not only composed of the ordinaryValid,Ready, andValuecomponents but also of anExtraset of wires irrelevant for the computation but needing to be propagated appropriately--and this will be a universal generator for all payloads.Thanks to
databeing the default operand/result of any compute components, no special annotations are needed: whatever code works for a pure payload containing onlydata(essentially, today's plain vanilla Dynamatic) will work for any speculative, tagged, etc. circuit. But of course people may want to do funny stuff with the payload. Suppose that I wanted to add 1 totag_74in a particular channel (do not ask me why, though...): The pass creating this circuit would connect the channel to a normalAddbut would specify an attribute that saysop1 = tag_74andres = tag_74(probablyop2would be stilldatacoming from a constant and would need no annotation). This seems perfectly trivial for Dynamatic to handle.Join and Fork
Finally,
Joinseems perfectly trivial to generate.Forkmaybe a little less, but it seems to me that it should takeValid,Ready, and three set of wires of arbitrary size: one going to the first output, one to the second output, and one to be left disconnected (this is easy to generalize to an N outputFork). It seems to me that Dynamatic has all it needs to connect these ports correctly based on annotations on the input and output signals (something likeout1 = [ data, tag_74 ]andout2 = [ tag_35 ], where implicitly whatever is not mentioned is disconnected). The default is, of course, that the whole payload goes everywhere. Finally, it may be handy to have more expressive ways to identify in the annotations which parts of the payload are to be used: something likeout1 = [ *, !tag_74 ]andout2 = [ tag_74 ]may be a way to say "split me outtag_74".Use Cases
I had in mind a few use-cases that I think I know; specifically tagging, speculation, and the memory dependence networks. If the stuff above is to be taken vaguely seriously, the first thing is to convince ourselves that all use-cases map on this effortlessly and make the least changes to the plain vanilla Dynamatic once implemented with these notions (e.g., buffering would continue to work). Are there other relevant use cases?
Beta Was this translation helpful? Give feedback.
All reactions