Elastic Channels #348

paolo-ienne · 2025-03-14T12:31:16Z

paolo-ienne
Mar 14, 2025
Maintainer

Preliminaries

I decided to write this after reading #336. I do not disagree with the content there but (1) I wonder if it is general enough or just part of a bigger issue and (2) it is one of those significant changes that we should do only once and forever hold our peace.

Remark no. 1: This may end up being only noise. I apologize and invite you to close it quickly if it is. It is perfectly possible that this is a nonproblem (but some discussions I had suggests it is not) or that someone has already thought the perfect solution. Nobody will get offended if you ignore this! 😄

Remark no. 2: This is certainly related to various existing issues and discussions. I read some and got lost in others. I am afraid it is easier to me to write clearly my thoughts on the problem than to disseminate comments here and there. I think it is also more efficient for someone competent on the topic to read this and dismiss this as garbage than respond to various piecemeal comments in various places. I believe this is related to #336, to #330, and to #172, at the very least.

Remark no. 3: I am a total ignoramus of modern C++ and of MLIR. Take this as my view on how conceptually things should be represented and how information should flow. In my view, the actual result should be the most elegant and natural implementation of this in the programming environment we are into--I am no judge of that.

Remark no. 4: I define names for the entities I use to try to be clear and consistent. They are bold. Of course, my names may be perfectly suboptimal and can be replaced with anything better without affecting this discussion.

Elastic Channels

I think that anything connecting two components in Dynamatic is an elastic channel or, for simplicity, a channel. If for weird reasons there should be nonelastic connections anywhere, they should be completely different objects that in no way mix and match with channels; they may be called wires, maybe, but I will not discuss them here. I think the fact that every connection in a well formed Handshake circuit/graph is elastic is, among others, essential for buffering passes to be sound--and buffering passes are the heart of Dynamatic and, at least in principle, should always work on any Handshake circuit/graph. I understand that this is one of the elements reproached to the bundle/unbundle system.

Control and Payload

Channels being elastic, they always contain handshake signals (control) and contain a payload. Of course, control represents the usual two Valid and Ready signals. The payload is a set of named ordinary signals (or group of wires, if you prefer). As said, the set may be empty, indicating what I think today is known as a ControlType. Each signal in the payload has at least the following attributes: a signal name (e.g., data or speculative_bit_region_3 or tag_region_17) and a signal type of any appropriate sort (essentially the number of individual wires is what matters most, but then it could be something like i10 to mean a ten-bit integer).

Signal names are completely arbitrary except for one special name (data). The payload of a particular channel may contain a data signal composed of 16 bits, a speculative_bit_region_3 of 1 bit, and a tag_region_17 of 6 bits. The C frontend will most likely generate circuits whose channels have a payload that contains exclusively a data signal (the C language value, probably corresponding to what is in the ChannelType today) or nothing.

Adding and Removing Payload

To add and remove new signals in the payload, one uses the ordinary Join and Fork components with minimal adaptation. A Join takes two inputs with two payloads and produces an output whose payload is the union of the two payloads. A Fork does not necessarily send the complete payload to both outputs (this may be the default implementation, of course) but only the specified subset to each output (more on the backend implementation later). A degenerate 1-output Fork may be used to simply suppress (i.e., leave disconnected) some signals.

The idea is that such generalized Joins and Forks serve effectively the same purpose as the deprecated bundle/unbundle operations but do it in a sound, fully elastic way. If @AyaElAkhras wants to add a tag of 3 bits, a Join will do; to remove it, she will use a Fork. @rpirayadi will extract control tokens from memory addresses by forking a full payload (the unmodified address token) and an empty payload (the control token); he will gate an operation with a simple Join between a control token and an operand. @shundroid and @emmet-murphy will handle speculative bits much as mentioned before for tags.

Managing Payload in the Backend

Firstly, please let's ignore for this discussion how the backend is implemented, whether we need an intermediate step between Handshake and Verilog, whether it is good to have it, etc. I know there are discussions about this, but here I am only concerned in what information is passed to what--not how.

Secondly, I now assume that every component will now be generated. Maybe it is not always necessary, but I prefer to assume that RTL languages are sufficiently horrible for not allowing any reasonable parametrization; if they are good enough, some generators may not be needed, but this changes nothing to the following discussion.

I think we need to consider three types of components, loosely defined as steering components, compute components, and the special case of Join and Fork.

Steering Components

Steering components do not care of what the payload is: a Branch (let's ignore for now the selection input) needs only to know the number of bits to steer. At hardware generation, Dynamatic must add the number of bits in all signals in the payload and ask the Branch generator to generate a component with so many bits in the datapath. Similarly for any type of Buffer (which makes perhaps the steering name not brilliant, but it is a nominalistic detail).

Compute Components

Compute components will need to identify an operand out of the payload; by default this is data. The generator of an Add, for instance, may be told to create an adder for a 6-bit operand (data of the first signal) and a 9-bit operand (data of the second signal). It would be told also that the first operand is accompanied by 17 extra bits and the second operand by 3 extra bits. Of course, the extra bits are counted by adding the number of bits in all signals in the payload, except for the operands. In the normal case, this is trivial and there is nothing peculiar. The generated Add will naturally output a sum signal of 6 + 9 bits and 20 extra bits. By some simple conventions on the ordering of the extra bits, Dynamatic will know which extra bits from the hardware will correspond to which signals in the payload; the sum would normally be the data signal. This is easy to implement in hardware; maybe the best would have the actual hardware components with each channel not only composed of the ordinary Valid, Ready, and Value components but also of an Extra set of wires irrelevant for the computation but needing to be propagated appropriately--and this will be a universal generator for all payloads.

Thanks to data being the default operand/result of any compute components, no special annotations are needed: whatever code works for a pure payload containing only data (essentially, today's plain vanilla Dynamatic) will work for any speculative, tagged, etc. circuit. But of course people may want to do funny stuff with the payload. Suppose that I wanted to add 1 to tag_74 in a particular channel (do not ask me why, though...): The pass creating this circuit would connect the channel to a normal Add but would specify an attribute that says op1 = tag_74 and res = tag_74 (probably op2 would be still data coming from a constant and would need no annotation). This seems perfectly trivial for Dynamatic to handle.

Join and Fork

Finally, Join seems perfectly trivial to generate. Fork maybe a little less, but it seems to me that it should take Valid, Ready, and three set of wires of arbitrary size: one going to the first output, one to the second output, and one to be left disconnected (this is easy to generalize to an N output Fork). It seems to me that Dynamatic has all it needs to connect these ports correctly based on annotations on the input and output signals (something like out1 = [ data, tag_74 ] and out2 = [ tag_35 ], where implicitly whatever is not mentioned is disconnected). The default is, of course, that the whole payload goes everywhere. Finally, it may be handy to have more expressive ways to identify in the annotations which parts of the payload are to be used: something like out1 = [ *, !tag_74 ] and out2 = [ tag_74 ] may be a way to say "split me out tag_74".

Use Cases

I had in mind a few use-cases that I think I know; specifically tagging, speculation, and the memory dependence networks. If the stuff above is to be taken vaguely seriously, the first thing is to convince ourselves that all use-cases map on this effortlessly and make the least changes to the plain vanilla Dynamatic once implemented with these notions (e.g., buffering would continue to work). Are there other relevant use cases?

murphe67 · 2025-03-14T13:03:14Z

murphe67
Mar 14, 2025
Maintainer

Thank you for the nice document explaining your point of view! @AyaElAkhras has organized a meeting today so I will not write too much beforehand, in case I change what I think, but for a brief immediate addition-

I think that anything connecting two components in Dynamatic is an elastic channel or, for simplicity, a channel. If for weird reasons there should be nonelastic connections anywhere, they should be completely different objects that in no way mix and match with channels; they may be called wires, maybe, but I will not discuss them here. I think the fact that every connection in a well formed Handshake circuit/graph is elastic is, among others, essential for buffering passes to be sound--and buffering passes are the heart of Dynamatic and, at least in principle, should always work on any Handshake circuit/graph. I understand that this is one of the elements reproached to the bundle/unbundle system.

This point I agree with 100%, and should be something we heavily emphasize when dealing with any new project that deals with the type system.

To add and remove new signals in the payload, one uses the ordinary Join and Fork components with minimal adaptation.

I am not convinced here on the details of how types should be altered (but do not yet have a concrete counter-proposal, I would not 100% stand behind what we have currently implemented as the long-term solution), but I can write up something more helpful on this in the near future.

It seems to me that Dynamatic has all it needs to connect these ports correctly based on annotations on the input and output signals.

This I also agree with 100% (and discussed with @AyaElAkhras @shundroid and @DanaKossaybati last week)- we are currently operating under a system where operations are fully type-agnostic. A simple example is that adding a speculator to a circuit, and adding the "spec" annotation to the type system, are two fully independent steps.

We then rely on the separate "type verification" system to check whether the IR has sensible type annotations- this has been built as a fully modular, customizable system, with rules that can be as complex or as simple as needed.

One final (complex) thing to throw in the ring:

Suppose that I wanted to add 1 to tag_74 in a particular channel (do not ask me why, though...): The pass creating this circuit would connect the channel to a normal Add

I think we need to have a conversation on what actually defines an operation: when I talk about this with @lana555, the recurring comment that comes up is: I want to be able to annotate this floating point adder to be variant A, with small area and poor performance, or variant B, with large area and high performance. And this considers an operation to be a "logical" construct, that an operation defines some relationship between input and output, but does not actually define any implementation characteristics at all.

But one primary thing we use operations for are timing analysis, and to drive the buffering passes, which need to know how each operation combines delays from the various inputs.

From a generation perspective, we can drop the concept of an operation entirely- we could have all computational units share the same operation type, with an implementation parameter that says "add the data payloads together using architecture variant 7" etc. etc.

If the operation type does not define the implementation, but does define something, we should try be very clear what that something is.

6 replies

murphe67 Mar 14, 2025
Maintainer

if this discussion is not related to the notion of elastic channels in Dynamatic

It comes from your point-

Suppose that I wanted to add 1 to tag_74 in a particular channel (do not ask me why, though...): The pass creating this circuit would connect the channel to a normal Add but would specify an attribute that says op1 = tag_74 and res = tag_74 (probably op2 would be still data coming from a constant and would need no annotation)

Does operation type fully determine the logical behaviour of the "node"? Or can the logical behaviour be different based on the channel type + attributes?

Your suggestion includes an attribute + channel type combo that switches which values the logical behaviour applies to- this is quite different from any previous definition we've had of what an operation is.

I do agree this is probably a different conversation than elastic channels as a narrower focus, for sure.

Extra note-

will return the timing model for that particular implementation of the component

So you don't think an operation fully defines a particular timing model either? So from your view, it seems operations do not really exist at all....

paolo-ienne Mar 14, 2025
Maintainer Author

Does operation type fully determine the logical behaviour of the "node"? Or can the logical behaviour be different based on the channel type + attributes?

An operation fully determines the logical behaviour of a node: an Add implement 2's complement addition, whatever the attributes. Attributes impact the type of the adder (which does not affect the logic function) and how many wires of a larger bundle at the input are to be ignored and carried to the output unmodified.

paolo-ienne Mar 14, 2025
Maintainer Author

will return the timing model for that particular implementation of the component

So you don't think an operation fully defines a particular timing model either? So from your view, it seems operations do not really exist at all....

I am not sure I follow. An Add completely defines a logic function but most certainly not a univocal timing model. A particular implementation of an adder in a particular technology defines a logic function and a timing behaviour.

Am I missing something? I cannot imagine that we disagree....

lana555 Mar 15, 2025
Maintainer

I read this three times and am failing to understand the problem :D I can only repeat what @paolo-ienne said:

I do not understand what is the issue with timing (and exactly how it relates to this discussion)

I wonder if we are mixing up two separate discussions into one? Why and how is timing related to how we decide to treat and represent tags? I understand that the timing might be impacted by tags (I need to account for the delays of ORing tags, etc.), but this seems a separate issue from how we represent the tags in the compiler, is it not? I think we need a concrete example of this dependency to understand the problem and start discussing how to address it...

murphe67 Mar 15, 2025
Maintainer

If it is distracting from "how we represent the tags in the compiler" as a conversation, it is not super important immediately.

For a (hopefully short) summary-

An Add completely defines a logic function but most certainly not a univocal timing model.

Dynamatic currently (in several places throughout the code) assumes a 1-to-1 mapping between "operation type" and "all specifics of the RTL code". With the addition of tags, this assumption is (weakly) broken in ways that are very low impact.

Some aspects of @paolo-ienne's proposal breaks this assumption in more consequential ways. If we decide this assumption no longer serves us, we need to agree on what any new assumption might be. For the impact on the codebase as a whole, we should be very thoughtful about how "configurable" any unit should be.

paolo-ienne · 2025-03-14T17:21:12Z

paolo-ienne
Mar 14, 2025
Maintainer Author

Duplicate Signals in the Payload

One aspect I completely overlooked before come to mind while discussing with @AyaElAkhras and @rpirayadi: if I Join, for instance, two payloads containing the signal tag, the description above says they should be part of the output payload. I think it is correct but there is a naming conflict for clearly there cannot be two tag signals in the output payload. I think it is essentially a cosmetic issue (i.e., they must be there) but I cannot think of a satisfactory solution, only hacks: tag_1 and tag_2, or a tag with the union of all the wires in the original tags. Maybe someone has a more elegant idea.

Who Processes the Extra Signals?

When an Add receives two operands with a tag, we probably would expect the output to have a single tag equal to any one of the input tags (someone should have probably guaranteed them to be the same in a correct circuit). When an Add receives two potentially speculative operands, possibly the output is speculative if at least one of the inputs is speculative (inclusive OR). If the Add propagates both tags or both speculative bits (with appropriate naming, see above), who combines them? My opinion is that this is an operation that should NOT be delegated to the operator, or we would either get an infinity of different Add components or the RTL generators would become unbearably cluttered. This processing of the tag and of the speculative bit should be implemented by the pass that is adding these signals, either with standard components or with ad-hoc ones. A simple 1-output Fork with an annotation out1 = [ *, !tag_2 ] suffices for the tag. For the speculation, one would need a Fork to split everything from the speculative bits (out1 = [ *, !speculative_1, !speculative_2 ], out2 = [ speculative_1 ], and out3 = [speculative_2 ]), an elastic Or to process out2 and out3, and a Join to recombine the whole; if this happens all the time in speculative circuits, a dedicated component could be easily introduced. I think it is nice and clean.

Renaming and Steering

Some may have noticed a small glitch above: I suppressed tag_2 but what should be called tag is actually called tag_1 (assuming one of the hacks above for the duplicate names). Clearly there would be soon a need for a Rewire operation that renames or exchanges signals in a payload. Annotation like rename tag_1 tag or exchange tag_1 tag_2 are probably selfexplanatory. The beauty is that they are pseudocomponents that would not be instantiated in the RTL but would simply be used to determine how to generate the RTL (in other words, they corresponds only to wires, their area and timing model are trivial and independent from the annotations, etc.).

10 replies

lana555 Mar 15, 2025
Maintainer

I totally agree with @murphe67: something that might be a problem in RTL gen should not be a reason to make our IR overly complicated (harder to analyze, slower to optimize with all our ILPs) and our circuits worse (bigger and with a longer CP).

I don't know how exactly RTL gen is done now, but it seems that everything discussed above could be easily supported.
There are two independent pieces of unit generation: 1. Generating the wrapper (containing the tag manipulation logic, in addition to the usual stuff). The code doing this should be largely independent of unit type (most units treat tags the same way, with some very specific exceptions like merge/mux) and independent of the underlying arithmetic generator (whether the unit comes from XLS or FloPoCo has nothing to do with the tags). 2. Instantiating the appropriate arithmetic/logic unit (adder from FloPoCo, Vitis, etc.).

Adding support for a new adder should not require changing anything in 1., and adding support for a new type of tag should not require changing anything in 2. I suppose this is more or less in line with Emmet's claim that adding Aya's tags does not take more than a line of code--the number of edits to do for tags should not be related to the size of the unit library/number of unit generators.

lana555 Mar 15, 2025
Maintainer

Another thing: I don't think that "hiding" the tag manipulation logic inside the units is complicating the timing analysis and optimization. The tags are always independent of each other and largely uniform across units: therefore, they can be easily precharacterized (if my tag manipulation is an OR, I measure the delay of the OR) and easily added to the timing model of the unit (if my unit has 3 spec tags and 5 Aya tags, I add 8 parallel nodes with the corresponding input-output delay to my unit's timing model).

That being said, while it is good that we think about this and can support such fine-grain modeling, I think we are overdoing it--the delay of the tag manipulation will in most (all?) practical cases be smaller than the delay of the accompanying units (we are talking of a single OR gate next to an adder), so I would probably put the implementation of this very very low on the priority list... :)

murphe67 Mar 15, 2025
Maintainer

Yes-

With the addition of tags, this assumption is (weakly) broken in ways that are very low impact.

I think we can ignore the tags for this kind of analysis.

lana555 Mar 15, 2025
Maintainer

Yes there are a couple of real Fork and Join but I wonder (i) if the logic synthesizer will be able to simplify them (because the two operands come from the same Fork)

Note that the fork has registers inside, so this is a sequential logic optimization problem--much more difficult for optimizers than a simple combinational optimization (I am not sure that this would get optimized away by Vivado, for instance).

murphe67 Mar 15, 2025
Maintainer

Also the buffering pass could insert something on any of these edges, theoretically

paolo-ienne · 2025-03-15T10:33:24Z

paolo-ienne
Mar 15, 2025
Maintainer Author

@emmet-murphy: thanks for all these lively exchanges! Call me dumb, though, but I am not clear what is the counterproposal you are putting forward--I just understand that you disagree with mine on a few specific points.... 😉

Can I read what you suggest somewhere? Could you give a comprehensive description, even only by difference, that I could relate to? I am truly sorry, but if I were to repeat what, overall, you are suggesting to do instead (or how you see the information flow) I would not know where to start.

2 replies

murphe67 Mar 15, 2025
Maintainer

For sure! We've changed things as little as possible from what was happening before (we really had no intention of implementing any of this, and were just prioritizing making speculation functional ASAP), with the goal of simple automation of what was previously "hardcoded".

I'm going to first copy-paste the VHDL for the "muli" op from our documentation-

-- Note: Headers and entity are generated by `generate_entity`.
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use work.types.all;

-- Entity of signal manager
entity handshake_muli_0 is
  port(
    clk : in std_logic;
    rst : in std_logic;
    lhs : in std_logic_vector(32 - 1 downto 0);
    lhs_valid : in std_logic;
    lhs_ready : out std_logic;
    lhs_spec : in std_logic_vector(1 - 1 downto 0);
    rhs : in std_logic_vector(32 - 1 downto 0);
    rhs_valid : in std_logic;
    rhs_ready : out std_logic;
    rhs_spec : in std_logic_vector(1 - 1 downto 0);
    result : out std_logic_vector(32 - 1 downto 0);
    result_valid : out std_logic;
    result_ready : in std_logic;
    result_spec : out std_logic_vector(1 - 1 downto 0)
  );
end entity;

-- Architecture of signal manager (buffered)
architecture arch of handshake_muli_0 is
  signal buff_in, buff_out : std_logic_vector(1 - 1 downto 0);
  signal transfer_in, transfer_out : std_logic;
begin
  -- Transfer signal assignments
  transfer_in <= lhs_valid and lhs_ready;
  transfer_out <= result_valid and result_ready;

  -- Note: `forwarded_extra_signals` is {"spec": "lhs_spec or rhs_spec"}.
  -- Concat/split extra signals for buffer input/output
  buff_in(0 downto 0) <= lhs_spec or rhs_spec;
  result_spec <= buff_out(0 downto 0);

  inner : entity work.handshake_muli_0_inner(arch)
    port map(
      clk => clk,
      rst => rst,
      -- Note: these port forwardings are generated by `generate_inner_port_forwarding`.
      lhs => lhs,
      lhs_valid => lhs_valid,
      lhs_ready => lhs_ready,
      rhs => rhs,
      rhs_valid => rhs_valid,
      rhs_ready => rhs_ready,
      result => result,
      result_valid => result_valid,
      result_ready => result_ready
    );

  -- Generate ofifo to store extra signals
  -- num_slots = 4, bitwidth = 1
  buff : entity work.handshake_muli_0_buff(arch)
    port map(
      clk => clk,
      rst => rst,
      ins => buff_in,
      ins_valid => transfer_in,
      ins_ready => open,
      outs => buff_out,
      outs_valid => open,
      outs_ready => transfer_out
    );
end architecture;

And then draw this visually-

So on a netlist level, we just see this-

murphe67 Mar 15, 2025
Maintainer

If we add one of @AyaElAkhras's tags to this, it becomes-

(The comments on the VHDL also show this, though we don't have a written out example)

These "signal manager" wrappers come in 3 or 4 variants (we are still exploring exact configurations and writing up documentation 😁), and mean the extra signals are independent of the underlying unit, as @lana555 mentions.

paolo-ienne · 2025-03-15T12:37:34Z

paolo-ienne
Mar 15, 2025
Maintainer Author

I am answering primarily #348 (reply in thread) and #348 (reply in thread). I am not totally clear but I
understand that there is a single generator for a broad set of operators and the generator itself will essentially instantiate hierarchically in a wrapper the component itself and the metadata processing circuitry (not sure such wrapper is fixed, out of a few cases, or generated for the specific situation). I assume the presented examples come from that generation process (although it is not quite said) and the wrapper can be generated any possible combination of speculation and tagging known today.

This does not seems unreasonable to me but I wonder if the advantages outweigh the shortcomings. There are some serious disadvantages I can think of: (i) If I am inventing some new way of constructing circuits, you are forcing me to go down to RTL generation to implement what I need, even if it is fundamentally unneeded (it is clearly neither for processing tags nor for speculative bits, as I tried to argue). (ii) It is not clear how many generators we will have, for I suppose the metadata processing logic is different for different components (it may be the same for Add and Mul, but is it the same for Branch?); I think it would be worth expressing more explicitly how many generators we will need to have and what parameters they will need to achieve generality--and yet such generality will be anyway fundamentally limited to what we can imagine today, whereas my proposal carries the whole payload and I cannot think of any limitation that will eventually push people to change existing RTL generators (can you?). (iii) I do not quite seem to understand how others see timing models; my view is essentially modelled on classic EDA modelling strategies (a library of characterized datapoints and interpolation between them) but if I start being essentially hierarchical (e.g., composing an arithmetic operator and a metadata processing block on the fly in a single RTL component), how do I get the timing behaviour? Does the RTL generator takes care also of the composition of the timing models and build them dymnamically. Or maybe is it the only option to say what someone has essentially said ("let's forget about it", #348 (reply in thread))?

I think a large part of the perception on how lightweight these disadvantages are is based on the assumption that we know a couple of cases of extra bits (speculation and tagging) and we assume that nobody will ever invent something fundamentally different. I find this not to be future proof (who knows what metadata tomorrow someone will want to invent?) and this seems to me terrible for an experimental compiler that we would want as open and flexible as possible.

I hear these advantages for the generators/wrappers: (i) The circuitry will be more efficient. This may be true in some cases (see #348 (reply in thread)) if implemented naively as I was suggesting, but it is certainly completely avoidable for all cases we know of today (e.g., nothing is needed for tagging because a single 1-output Fork, clearly stateless, is sufficient and naturally zero-delay; for speculation, my Fork-Or-Join can be implemented as a single stateless component identical to your logic in the wrapper and placed after or before the operator). Although perfectly identical circuitwise, a new component would be a separate component and, in many cases, an optional one: if I truly need or want it (and maybe I do not), much better to implement one more component/generator that touching a wrapper that serves already N purposes and that I now need to tweak to serve the (N+1)-th (we know how great is to touch something tricky that someone now gone wrote five years ago...). I think the advantage is simply not there. (ii) Otherwise, the Handshake IR will be more cluttered and some analyses will be made hard by the larger number of components. I see that and it is an objective difficulty. I would think it wiser to find other, maybe more general, ways to handle this, if it is truly a serious problem (compilers handle many thousands of instructions at once and people can live with that). One idea could be to have an attribute, possible for all components, called hide; it may do a couple of things such as make it disappear gracefully (with some decoration?) in Dotty output, not to clutter things, and forbid the placement of buffers right before by collapsing the timing model of the preceding component with this one (thus achieving the very same result you guys achieve with the wrapper but without making any assumption on the wrapper timing nor complexifying the timing modelling); maybe this would serve in other places to help the buffering algorithms if they struggle (e.g., invent a pass that hides all components whose timing is below some threshold assuming that they offer an irrelevant buffer placement resolution at the price of creating a more complex optimization problem).

I think we should weight pros and cons very carefully. In an experimental compiler, to me the most valuable feature is generality and openness to things that, today, we do not have the slightest idea of.

1 reply

murphe67 Mar 15, 2025
Maintainer

Thanks for the consideration plus very detailed reply! It is very nice to properly discuss these things.

For specific points-

more explicitly how many generators we will need to have and what parameters they will need to achieve generality

The goal of the RTL generation process is to implement something simple, comprehendible, and extensible. We pre-implement common use-cases, to reduce code duplication, but there are no rules for how a component should be generated. You could pass an implementation parameter called "tags-paolo" to the backend (per operation) and selectively handle meta-data completely differently on two different add operations.

If you want to do something completely different with the "meta-data" for a specific component inside our backend, you just have to write the VHDL for it.

If I am inventing some new way of constructing circuits, you are forcing me to go down to RTL generation to implement what I need, even if it is fundamentally unneeded

I don't think anyone is forced to do anything? There is nothing stopping someone from doing exactly as you describe with the elastic OR and generating that with our tag-aware backend. If the adder does not receive extra signals in the IR, it will generate a VHDL unit without any extra signals.

whereas my proposal carries the whole payload and I cannot think of any limitation that will eventually push people to change existing RTL generators (can you?).

I'm actually not sure what you mean by "your proposal" here- placing an elastic OR in the IR? In your original post you write in the compute unit section-

It would be told also that the first operand is accompanied by 17 extra bits and the second operand by 3 extra bits. Of course, the extra bits are counted by adding the number of bits in all signals in the payload, except for the operands. In the normal case, this is trivial and there is nothing peculiar. The generated Add will naturally output a sum signal of 6 + 9 bits and 20 extra bits.

This is actually not enough info to pass the compute units, as different tags should be combined differently based on their names. Concatenating all the meta-data into a single N-bit signal is not possible- we need to maintain their identities properly from when they enter a compute unit to when they exit it.

how do I get the timing behaviour? Does the RTL generator takes care also of the composition of the timing models and build them dymnamically. Or maybe is it the only option to say what someone has essentially said ("let's forget about it", #348 (reply in thread))?

I wouldn't summarize this conversation as "lets's forget about it", I would summarize it as "forwarding extra signals in parallel beside an arithmetic operation does not affect the timing behaviour of the arithmetic unit". If you want variable timing behaviour based on implementation parameters (as @lana555 has asked for several times), then yes, you would need to add dynamic timing modelling. This is totally separate to generation, though.

this seems to me terrible for an experimental compiler that we would want as open and flexible as possible.

I think you have slightly mistaken my (paraphrased) comment of "we have made it very easy to add simple additional functionality to the speculation RTL" for "you must use our speculation RTL wrapper for everything you do".

Our backend was designed with the idea that we have no idea how anyone will want to generate RTL. It supports as much flexibility as possible, and in the general case, trades safety for customizability at each opportunity. There are no rules for how to use it, apart from those enforced by our reliance on the current, very awkward netlist generator. (nightmarish was the word @shundroid used for it yesterday 😁 😅 )

paolo-ienne · 2025-03-15T12:39:10Z

paolo-ienne
Mar 15, 2025
Maintainer Author

@murphe67: On a different note, I am not clear (maybe there is a descriptions somewhere I have not read?) how you identify different parts of the extra bits for metadata processing. How would @AyaElAkhras instantiate in Handshake an Aligner to say that it aligns over tag_73, ignoring all other metadata/extra bits? How would you instantiate a Commit unit or a Speculator to express that they handle some particular speculative_22 extra bit? I suppose they would be annotations to the Aligner component: can you give an example?

1 reply

murphe67 Mar 15, 2025
Maintainer

Yes, it is quite simple to add an additional implementation param to hw.params and then serialize it using the JSON file- this is unchanged from the previous versions of the backend and works very well.

paolo-ienne · 2025-03-17T20:17:01Z

paolo-ienne
Mar 17, 2025
Maintainer Author

Some of the premises of this discussion are slightly at odd with the Type System Specs; the difference is mostly in a clean separation between data and metadata. Largely, whatever is written in this discussion can be very cosmetically adapted to that.

0 replies

paolo-ienne · 2025-03-17T20:19:25Z

paolo-ienne
Mar 17, 2025
Maintainer Author

I close this discussion as it did not seem too successful in eliciting a discussion of the pros and cons of the various decisions, especially on reviewing how decisions at this level would impact RTL generation and timing analysis. We will probably need to come back to these issues at another point in time and in a more constructive setting.

0 replies

shundroid · 2025-03-17T20:34:59Z

shundroid
Mar 17, 2025
Collaborator

@paolo-ienne It seems you just closed this discussion, but I was writing a comment before so please let me post it!

@paolo-ienne Thanks for opening this discussion! I haven’t had time to read everything here, so apologies if I’ve misunderstood something.

First, I want to clarify that this is purely an engineering issue, not a research question.

In an experimental compiler, to me the most valuable feature is generality and openness to things that, today, we do not have the slightest idea of.

Unfortunately, this is incorrect. Generality in the current implementation does not imply openness to future changes—if anything, it often makes modifications harder. We’ve struggled with this over the past months.

I believe this is a key reason behind the differing attitudes toward implementation, so I’ll explain in detail.

Dynamatic’s codebase is inflexible and hard to understand precisely because it was written in an overly generalized way—especially for hypothetical features that never materialized. Generalization for vague future possibilities just adds complexity without real benefit even when those cases finally arise.

For example, export-rtl abstracts RTL matching through RTLRequest, supposedly allowing for higher-level requests beyond simple dependencies. But when we needed it for the signal manager, it was useless. Similarly, RTLWriter is designed to generalize over VerilogWriter and VHDLWriter, yet I’ve heard it’s nearly impossible to extend it for SMV writing. This shows that abstraction without concrete use cases rarely pays off.

A worse case is Handshake IR. A former engineer decided:

From Handshake forward, the original CFG no longer exists

This abstraction—while may conceptually valid (e.g., memory controller doesn't belong to any basic block)—made it significantly harder to implement optimization passes requiring CFG representation (e.g., speculation, fast token delivery). The lesson here is that generalization in one direction often imposes restrictions in another.

This concern applies to your proposal as well. The idea of "adding and removing payload" feels unnatural for speculation signals. Aside from out-of-order tags, the spec bit is never an independent channel—we don’t “add” or “remove” it. Instead, it’s simply used as an accompanying state at CommitOp, for example.

Now, to address your points:

I do not disagree with the content there but (1) I wonder if it is general enough or just part of a bigger issue

We should generalize only for current needs.

and (2) it is one of those significant changes that we should do only once and forever hold our peace.

"we should do only once"- I strongly disagree. That mindset leads to rigid, inflexible code. A healthy codebase allows for change when needed.

Take this as my view on how conceptually things should be represented and how information should flow. In my view, the actual result should be the most elegant and natural implementation of this in the programming environment we are into--I am no judge of that.

I’m not against conceptual clarity. But code should be optimized for the present, not for hypothetical future use cases.

In conclusion, the codebase is not a place for brainstorming. It should reflect what is most useful and ideal for current needs. This clarity makes future generalization easier when truly necessary.

I generally welcome discussions about future directions, but they should be based on or serve as counterproposals to concrete implementations.

0 replies

paolo-ienne · 2025-03-17T20:40:21Z

paolo-ienne
Mar 17, 2025
Maintainer Author

I guess it is time to close this again.... 😉

0 replies

shundroid · 2025-03-19T17:11:32Z

shundroid
Mar 19, 2025
Collaborator

I wasn’t sure whether to post this, but I think it could be useful.

Personally, I'm even not sure whether the concept of extra signals (or "payload with multiple signals," using @paolo-ienne's term) is permanent or not. I’ve talked this to @murphe67 since last December.

While working on speculation, I realized the spec bit is more like a single shared state across the loop. It doesn’t need to accompany every channel—it can be wired directly to relevant units like save-commits and commits with proper buffering. I’m not familiar with out-of-order execution, but since most units (e.g., arithmetic ones) only accept aligned tags across inputs, I’m unsure why the tag needs to be tied to the data (wouldn’t a control-token-like tag accompanying control flow decisions be enough?).

The only practical advantage of extra signals is that they share handshake logic with data. But something like the rigidifier can allow different channels to share handshake logic instead:

Here, the bold lines represent data channels, while the narrow lines represent extra signal (independent) channels. The shared_handshake op reorganizes the handshake logic internally, assigning the same valid/ready signals to all received channels. I mentioned this to @murphe67 last year, but I remember @gioelegott also shared the same idea with a clearer figure in the last meeting—feel free to share that figure if you don’t mind.

So why did I implement speculation using extra signals? After discussing with Emmet, we agreed that our priority was getting speculation working in Dynamatic with our current understanding. It’s common to find better solutions during implementation, but delaying completion wasn’t an option—it’s already late since speculation is published, seemingly mostly due to the unmodifiable codebase. Alternating the approach requires additional engineering effort (including investigation and testing). (Also, another reason is Emmet didn’t seem fully convinced by this idea.)

To be clear, I’m not arguing whether extra signals will be needed in the future or whether the signal manager should be implemented. My point is that future possibility is broader and unpredictable, and right now, the signal manager is essential for speculation and out-of-order execution.

If, after speculation and out-of-order execution are fully implemented, someone sees a clear benefit in removing extra signals or replacing them with a payload system, they’re free to do so—but that should be based on their needs at that time, and on the actual implementation. This again ties back to the importance of code flexibility: future changes should be driven by actual implementation needs at that time, not by our present intentions.

Finally, I want to stress that the ongoing backend refactoring is more than just handling extra signals—it’s a broader effort.

4 replies

murphe67 Mar 19, 2025
Maintainer

Yeah my general response to "we should propagate the extra info separately" is "seems high effort low reward"

The best case example is this-

If f(x) and g(x) are both high latency, we buffer the tag information twice despite it being the same.

So there is definitely some area redundancy happening with our current approach.

But, if we propagate the info independently, I don't think rigidification helps much (for my version of the proposal), because it is genuinely independant: the data will transfer at a different time.

Is the increased handshaking worth the removal of the slight over-buffering? It is not a super promising avenue of exploration, imo, compared to the other (horrific) over-buffering that also happens in speculation, which I think is a better effort-reward situation.

If we instead look at Shun's proposal, I think we can only represent it this way post-buffering? since we have handshake-less values/signal-level representation. So implementation wise is high-effort, for non-conceptual reasons.

shundroid Mar 19, 2025
Collaborator

Is the increased handshaking worth the removal of the slight over-buffering? It is not a super promising avenue of exploration, imo,

Yes, I agreed with this. I wasn’t suggesting a change here. Just noting that the spec bit is 1-bit, while tags aren’t necessarily (though still relatively small, I suppose).

But, if we propagate the info independently, I don't think rigidification helps much (for my version of the proposal), because it is genuinely independant: the data will transfer at a different time.

Yes, I think there’s a tradeoff between redundant handshakes and buffering. Handshake-sharing works well only if f(x) and g(x) are deterministic (e.g., no memory operations).

If we instead look at Shun's proposal, I think we can only represent it this way post-buffering?

That’s certainly true. Handshake-sharing optimizations should be applied post-buffering.

since we have handshake-less values/signal-level representation.

Yes, I realized this is essentially a signal-level representation. But I don’t see much difference from a rigidifier—can’t a rigidifier also only be placed post-buffering and serve as a signal-level representation? (Maybe a separate discussion.)

murphe67 Mar 19, 2025
Maintainer

Maybe I have misinterpreted your last point, let me write out my understanding:

We add speculation to the circuit using the current extra signals representation

We run speculation-aware buffering to buffer the circuit

We then transfer extra signals to a verbose signal-level representation where the signal handling is not encapsulated inside the operation

On this representation, we can write a complex optimization pass which reduces the resource consumption of the extra signal?

Then we need a different (but simple) backend to generate RTL from the signal level rep

shundroid Mar 21, 2025
Collaborator

I didn't intend to mention signal-level representation.

My figure above is totally in the handshake-interface granularity, including OR.
And use shared_handshake op to virtually pick handshake of one and discard the other.

I think this works in general, but I agree seems difficult. Maybe I will elaborate on the idea later.
Then I'd like to open another issue or discussion.

P.S. this is a more detailed figure of shared_handshake

Elastic Channels #348

Uh oh!

paolo-ienne Mar 14, 2025 Maintainer

Preliminaries

Elastic Channels

Control and Payload

Adding and Removing Payload

Managing Payload in the Backend

Steering Components

Compute Components

Join and Fork

Use Cases

Replies: 10 comments · 24 replies

Uh oh!

murphe67 Mar 14, 2025 Maintainer

Uh oh!

murphe67 Mar 14, 2025 Maintainer

Uh oh!

paolo-ienne Mar 14, 2025 Maintainer Author

Uh oh!

paolo-ienne Mar 14, 2025 Maintainer Author

Uh oh!

lana555 Mar 15, 2025 Maintainer

Uh oh!

murphe67 Mar 15, 2025 Maintainer

Uh oh!

paolo-ienne Mar 14, 2025 Maintainer Author

Duplicate Signals in the Payload

Who Processes the Extra Signals?

Renaming and Steering

Uh oh!

lana555 Mar 15, 2025 Maintainer

Uh oh!

lana555 Mar 15, 2025 Maintainer

Uh oh!

murphe67 Mar 15, 2025 Maintainer

Uh oh!

lana555 Mar 15, 2025 Maintainer

Uh oh!

murphe67 Mar 15, 2025 Maintainer

Uh oh!

paolo-ienne Mar 15, 2025 Maintainer Author

Uh oh!

Uh oh!

murphe67 Mar 15, 2025 Maintainer

Uh oh!

Uh oh!

murphe67 Mar 15, 2025 Maintainer

Uh oh!

paolo-ienne Mar 15, 2025 Maintainer Author

Uh oh!

murphe67 Mar 15, 2025 Maintainer

Uh oh!

paolo-ienne Mar 15, 2025 Maintainer Author

Uh oh!

murphe67 Mar 15, 2025 Maintainer

Uh oh!

paolo-ienne Mar 17, 2025 Maintainer Author

Uh oh!

paolo-ienne Mar 17, 2025 Maintainer Author

Uh oh!

Uh oh!

shundroid Mar 17, 2025 Collaborator

Uh oh!

paolo-ienne Mar 17, 2025 Maintainer Author

Uh oh!

shundroid Mar 19, 2025 Collaborator

Uh oh!

murphe67 Mar 19, 2025 Maintainer

Uh oh!

shundroid Mar 19, 2025 Collaborator

Uh oh!

Uh oh!

murphe67 Mar 19, 2025 Maintainer

Uh oh!

paolo-ienne
Mar 14, 2025
Maintainer

Replies: 10 comments 24 replies

murphe67
Mar 14, 2025
Maintainer

murphe67 Mar 14, 2025
Maintainer

paolo-ienne Mar 14, 2025
Maintainer Author

paolo-ienne Mar 14, 2025
Maintainer Author

lana555 Mar 15, 2025
Maintainer

murphe67 Mar 15, 2025
Maintainer

paolo-ienne
Mar 14, 2025
Maintainer Author

lana555 Mar 15, 2025
Maintainer

lana555 Mar 15, 2025
Maintainer

murphe67 Mar 15, 2025
Maintainer

lana555 Mar 15, 2025
Maintainer

murphe67 Mar 15, 2025
Maintainer

paolo-ienne
Mar 15, 2025
Maintainer Author

murphe67 Mar 15, 2025
Maintainer

murphe67 Mar 15, 2025
Maintainer

paolo-ienne
Mar 15, 2025
Maintainer Author

murphe67 Mar 15, 2025
Maintainer

paolo-ienne
Mar 15, 2025
Maintainer Author

murphe67 Mar 15, 2025
Maintainer

paolo-ienne
Mar 17, 2025
Maintainer Author

paolo-ienne
Mar 17, 2025
Maintainer Author

shundroid
Mar 17, 2025
Collaborator

paolo-ienne
Mar 17, 2025
Maintainer Author

shundroid
Mar 19, 2025
Collaborator

murphe67 Mar 19, 2025
Maintainer

shundroid Mar 19, 2025
Collaborator

murphe67 Mar 19, 2025
Maintainer