-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Introduction
Performance optimization algorithms like the one in FPGA20 aim to optimize dataflow subcircuits called CFDFCs, which correspond to one or multiple loops in the control-flow graph. Since then, many papers have built on top of this idea in these few ways:
- New performance optimization algorithms: FPL22 (detailed timing model), MapBuf (even more detailed timing model).
- Needs the information from buffering: FCCM22 (resource sharing needs buffering decision), CRUSH (also resource sharing), LSQ sizing (needs CFDFC II and buffer positions to estimate the lifetime of memory access entries in LSQ).
- New CF to handshake algorithm: Fast token delivery (Implementation of Fast Token Delivery flow within Dynamatic #177: requires a different way of mapping loops in the original program to the CFDFC subgraph), multi-threading (same as fast token).
The Requirements
Limitations of the current design. So far, the CFDFC data structure suits 1, but it is not designed with 2 and 3 in mind. I would like to initiate the discussion for a new implementation in this issue.
Better Caching Mechanism
Currently, the buffer placement pass would log the performance optimization result directly inside the MLIR file as a function attribute. This includes:
- For each CFDFC, a list of BBs.
- For each CFDFC, the achieved II result.
Note
Duplicated logic. As seen from the LSQ sizing pass: The CFDFC class in the buffering pass is not the same as the CFDFC class in the LSQ sizing pass. Thus, the LSQ sizing pass needs its logic for recreating the CFDFCs from the function attributes.
Note
Complex instrumentation. As seen from the sharing pass: The buffer placement passes are instrumented to retrieve the performance optimization decision. The buffer passes are internally called by the sharing pass (instead of reading the cached result from somewhere).
This approach is against the modularity principle (but is fundamentally due to the fact that caching the performance information is hard).
Note
Rely on BB organization. As I will discuss later, some handshake circuits would omit BB organization. Therefore, using a list of BBs cannot recover the CFDFC from this kind of circuit.
Thus, I suggest creating a simple Graphviz DOT format for caching the performance optimization decision across the optimization passes. Here is an example:
digraph cfdfc0 {
// graph attribute
graph [throughput=0.50];
// node attribute
merge1 [BB=1];
merge1 -> buffer1;
buffer1 -> branch1;
...
}For each CFDFC, a .DOT file is created and it encodes the following information:
- The nodes and edges in the CFDFC.
- The throughput reported by the buffer placement pass.
- The BB of each node.
The CFDFC would be extended to:
- Export the dot graph.
- Parse the dot graph into the CFDFC data structure.
- Check if the dot graph is legal.
General CFDFC Extraction Method
More recent CF to handshake conversion passes (#177) omit BB organization for performance merit. The current CFDFC extraction logic would not work here because it relies on BB information. Yet, if we can figure out a new way to extract CFDFC, the performance optimization algorithm should work out of the box.
Current CFDFC extraction logic:
The result of software profiling returns a list of BBs transitions. These transitions are then used to construct a list of "BB cycles". For example (the matvec example):
// An important inner loop
2 -> 2
// A less important outer loop
1 -> 2 -> 3 -> 1
Based on these two BB loops, we identify:
- Nodes that are in any of the BBs,
- Edges that connect two nodes in the same BB or two nodes that correspond to any BB transition in the BB loop.
An extraction flow. Ideally, we could have:
- Extract BB cycles in the Cf level.
- Find the predicates (i.e., arith.cmpi, arith.cmpf, etc) in the Cf level that keep the BB cycles running.
- Retain the names of the predicates when doing FtdCfToHandshake.
- Later when doing the buffering, use the relation to extract CFDFCs.
Note
Effect of omitting BB organization. Yet, the fast token method would create "weird connections", like 1 -> 3 for the second loop. These connections would be completely ignored using the current logic, which might cause a performance penalty due to lack of token balancing.
Proposed solution:
(here I am improvising and definitely need your help @AyaElAkhras @paolo-ienne @lana555)
Instead of identifying a list of BB loops, we identify the set of conditions that will "keep the loop running".
For instance, for the same loop 2 -> 2, we might get a hypothetical set of conditions:
// The loop keeps going as long as cmpi0 outputs a true
FORMULA := cmpi0
From here, we can propagate this information to the rest of the circuit to remove inactive parts when the loop is running. For instance:
- recursively remove the branch output that is not selected by the condition
- recursively remove the mux input that is not selected by the condition
- (maybe there are more?)
Remarks
What do you think? I'd like to hear your thoughts on this :D