Skip to content

Conversation

@ziadomalik
Copy link
Collaborator

@ziadomalik ziadomalik commented Jan 17, 2026

Status: 17.01.25

This PR is not mergeable yet, in my goal to latency and occupancy balance fir.c, I only got:

  • ~3000 cycles normally
  • ~2000 cycles with this hack

Context on the hack that makes me achieve ~2000 cycles

Multiply latency seems to be 5, so accordingly the LP places 5 DV buffers.
When I did the manual placement I used 6 DV, (I think we debugged this together).
So the way I remember this, for II=1 with a 5-cycle component, we need 6 DV Buffers because:

  • Within the first 5 cycles, 5 tokens enter while first token traverses multiply.
  • And then we need one extra buffer while the first token's output is propagated to the join.
    Is this the correct reasoning? If yes, with what constraint does the paper account for this? Or is my mistake maybe somewhere else.

Another Problem with Section 5, Equation 11:

The paper says:

For each CFC CFC_𝑖 and each pair of reconvergent paths p_1, p_2 in CFC_i we enforce:

$$ Occupancy_{\text{CFC}_i}(p_1) = Occupancy_{\text{CFC}_i}(p_2) $$

When enabled, the solver concentrates occupancy on the wrong channels.
What we want is: fork5 → N=6, fork3 → N=6 (matching L values)
Got: mux1 → N=5, fork5 → N=4 (4 FIFO + 1 DV instead of 6 DV)
The solver satisfies the path sum cheaply by putting occupancy on narrow channels (mux output) instead of where LP1 added latency (fork outputs). How should Equation 11 be implemented correctly? Is there a constraint I am missing that's tying occupancy to specific channels? For now I just disabled it.

Debug Output:

=== FPGA24 Buffer Placement ===
Found 1 CFDFCs
Computed 17 SCCs in CFDFC with 30 nodes. Non-cyclic subgraph has 19 edges.
Found 5 cycles in CFDFC.
Found 9 joins in CFDFC.
Found 0 synchronizing cycle pairs.
Found 0 synchronizing cycle pairs
Enumerated 3 transition sequences
  Processing sequence 1/3 (found 0 unique paths, 0 duplicates skipped)
Found 46 reconvergent paths from 10 forks and 18 joins.
Found 7 reconvergent paths from 7 forks and 9 joins.
  Processing sequence 3/3 (found 33 unique paths, 20 duplicates skipped)
Found 18 reconvergent paths from 7 forks and 9 joins.
Found 44 unique reconvergent paths across 2 graphs (27 duplicates skipped)
=== Setting up LP1 (Latency Balancing) ===
[LP1] Adding latency variables...
[LP1]   Found 46 channels (patterns + CFDFCs)
[LP1]   Created 46 channel variables
[LP1]   Created 44 reconvergent path vars, 0 sync cycle vars
[LP1] Adding reconvergent path constraints (44 paths)...
[LP1]   Processing reconvergent path 1/44
[LP1]     -> 2 simple paths
[LP1]     -> 2 simple paths
[LP1]     -> 3 simple paths
[LP1] Reconvergent path 2: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=6, L_c vars=6
  Path 1: base=5.000000e+00, edges=5, L_c vars=5
  Path 2: base=0.000000e+00, edges=6, L_c vars=6
  [DETAIL] Channels on each path:
    Path 0 (base=5.000000e+00):
      fork2_outs_0_ins_trunci0 [HAS L_c]
      trunci0_outs_rhs_subi0 [HAS L_c]
      subi0_result_addrIn_load1 [HAS L_c]
      load1_dataOut_rhs_muli0 [HAS L_c]
      muli0_result_rhs_addi0 [HAS L_c]
      addi0_result_data_cond_br2 [HAS L_c]
    Path 1 (base=5.000000e+00):
      fork2_outs_1_ins_trunci1 [HAS L_c]
      trunci1_outs_addrIn_load0 [HAS L_c]
      load0_dataOut_lhs_muli0 [HAS L_c]
      muli0_result_rhs_addi0 [HAS L_c]
      addi0_result_data_cond_br2 [HAS L_c]
    Path 2 (base=0.000000e+00):
      fork2_outs_2_ins_extsi6 [HAS L_c]
      extsi6_outs_lhs_addi1 [HAS L_c]
      addi1_result_ins_fork4 [HAS L_c]
      fork4_outs_1_lhs_cmpi0 [HAS L_c]
      cmpi0_result_ins_fork5 [HAS L_c]
      fork5_outs_1_condition_cond_br2 [HAS L_c]
[LP1]     -> 4 simple paths
[LP1] Reconvergent path 3: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=7, L_c vars=7
  Path 1: base=5.000000e+00, edges=6, L_c vars=6
  Path 2: base=0.000000e+00, edges=7, L_c vars=7
  Path 3: base=0.000000e+00, edges=9, L_c vars=9
[LP1]     -> 2 simple paths
[LP1]     -> 4 simple paths
[LP1]     -> 6 simple paths
[LP1] Reconvergent path 6: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=0.000000e+00, edges=4, L_c vars=4
  Path 1: base=5.000000e+00, edges=10, L_c vars=10
  Path 2: base=5.000000e+00, edges=9, L_c vars=9
  Path 3: base=0.000000e+00, edges=10, L_c vars=10
  Path 4: base=0.000000e+00, edges=12, L_c vars=12
  Path 5: base=0.000000e+00, edges=5, L_c vars=5
[LP1]     -> 4 simple paths
[LP1]     -> 8 simple paths
[LP1]     -> 14 simple paths
[LP1] Reconvergent path 9: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=10, L_c vars=10
  Path 1: base=5.000000e+00, edges=9, L_c vars=9
  Path 2: base=0.000000e+00, edges=5, L_c vars=5
  Path 3: base=5.000000e+00, edges=11, L_c vars=11
  Path 4: base=5.000000e+00, edges=10, L_c vars=10
  Path 5: base=5.000000e+00, edges=15, L_c vars=15
  Path 6: base=5.000000e+00, edges=14, L_c vars=14
  Path 7: base=5.000000e+00, edges=16, L_c vars=16
  Path 8: base=5.000000e+00, edges=15, L_c vars=15
  Path 9: base=0.000000e+00, edges=11, L_c vars=11
  Path 10: base=5.000000e+00, edges=18, L_c vars=18
  Path 11: base=5.000000e+00, edges=17, L_c vars=17
  Path 12: base=0.000000e+00, edges=13, L_c vars=13
  Path 13: base=0.000000e+00, edges=6, L_c vars=6
[LP1]   Processing reconvergent path 11/44
[LP1]     -> 4 simple paths
[LP1]     -> 8 simple paths
[LP1]     -> 18 simple paths
[LP1] Reconvergent path 12: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=11, L_c vars=11
  Path 1: base=5.000000e+00, edges=10, L_c vars=10
  Path 2: base=0.000000e+00, edges=11, L_c vars=11
  Path 3: base=0.000000e+00, edges=6, L_c vars=6
  Path 4: base=5.000000e+00, edges=12, L_c vars=12
  Path 5: base=5.000000e+00, edges=11, L_c vars=11
  Path 6: base=5.000000e+00, edges=16, L_c vars=16
  Path 7: base=5.000000e+00, edges=15, L_c vars=15
  Path 8: base=0.000000e+00, edges=16, L_c vars=16
  Path 9: base=5.000000e+00, edges=17, L_c vars=17
  Path 10: base=5.000000e+00, edges=16, L_c vars=16
  Path 11: base=0.000000e+00, edges=17, L_c vars=17
  Path 12: base=0.000000e+00, edges=12, L_c vars=12
  Path 13: base=5.000000e+00, edges=19, L_c vars=19
  Path 14: base=5.000000e+00, edges=18, L_c vars=18
  Path 15: base=0.000000e+00, edges=19, L_c vars=19
  Path 16: base=0.000000e+00, edges=14, L_c vars=14
  Path 17: base=0.000000e+00, edges=7, L_c vars=7
[LP1]     -> 3 simple paths
[LP1] Reconvergent path 13: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=7, L_c vars=7
  Path 1: base=5.000000e+00, edges=6, L_c vars=6
  Path 2: base=0.000000e+00, edges=2, L_c vars=2
[LP1]     -> 4 simple paths
[LP1] Reconvergent path 14: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=8, L_c vars=8
  Path 1: base=5.000000e+00, edges=7, L_c vars=7
  Path 2: base=0.000000e+00, edges=8, L_c vars=8
  Path 3: base=0.000000e+00, edges=3, L_c vars=3
[LP1]     -> 5 simple paths
[LP1] Reconvergent path 15: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=9, L_c vars=9
  Path 1: base=5.000000e+00, edges=8, L_c vars=8
  Path 2: base=0.000000e+00, edges=9, L_c vars=9
  Path 3: base=0.000000e+00, edges=11, L_c vars=11
  Path 4: base=0.000000e+00, edges=4, L_c vars=4
[LP1]     -> 2 simple paths
[LP1]     -> 3 simple paths
[LP1]     -> 3 simple paths
[LP1]     -> 6 simple paths
[LP1]   Processing reconvergent path 21/44
[LP1]     -> 8 simple paths
[LP1] Reconvergent path 20: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=9, L_c vars=9
  Path 1: base=5.000000e+00, edges=8, L_c vars=8
  Path 2: base=5.000000e+00, edges=10, L_c vars=10
  Path 3: base=5.000000e+00, edges=9, L_c vars=9
  Path 4: base=0.000000e+00, edges=5, L_c vars=5
  Path 5: base=5.000000e+00, edges=12, L_c vars=12
  Path 6: base=5.000000e+00, edges=11, L_c vars=11
  Path 7: base=0.000000e+00, edges=7, L_c vars=7
[LP1]     -> 3 simple paths
[LP1]     -> 11 simple paths
[LP1] Reconvergent path 22: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=10, L_c vars=10
  Path 1: base=5.000000e+00, edges=9, L_c vars=9
  Path 2: base=0.000000e+00, edges=10, L_c vars=10
  Path 3: base=5.000000e+00, edges=11, L_c vars=11
  Path 4: base=5.000000e+00, edges=10, L_c vars=10
  Path 5: base=0.000000e+00, edges=11, L_c vars=11
  Path 6: base=0.000000e+00, edges=6, L_c vars=6
  Path 7: base=5.000000e+00, edges=13, L_c vars=13
  Path 8: base=5.000000e+00, edges=12, L_c vars=12
  Path 9: base=0.000000e+00, edges=13, L_c vars=13
  Path 10: base=0.000000e+00, edges=8, L_c vars=8
[LP1]     -> 4 simple paths
[LP1]     -> 2 simple paths
[LP1]     -> 2 simple paths
[LP1]     -> 2 simple paths
[LP1]     -> 4 simple paths
[LP1]     -> 6 simple paths
[LP1] Reconvergent path 28: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=8, L_c vars=8
  Path 1: base=5.000000e+00, edges=7, L_c vars=7
  Path 2: base=0.000000e+00, edges=3, L_c vars=3
  Path 3: base=5.000000e+00, edges=10, L_c vars=10
  Path 4: base=5.000000e+00, edges=9, L_c vars=9
  Path 5: base=0.000000e+00, edges=5, L_c vars=5
[LP1]     -> 2 simple paths
[LP1]   Processing reconvergent path 31/44
[LP1]     -> 4 simple paths
[LP1]     -> 8 simple paths
[LP1] Reconvergent path 31: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=9, L_c vars=9
  Path 1: base=5.000000e+00, edges=8, L_c vars=8
  Path 2: base=0.000000e+00, edges=9, L_c vars=9
  Path 3: base=0.000000e+00, edges=4, L_c vars=4
  Path 4: base=5.000000e+00, edges=11, L_c vars=11
  Path 5: base=5.000000e+00, edges=10, L_c vars=10
  Path 6: base=0.000000e+00, edges=11, L_c vars=11
  Path 7: base=0.000000e+00, edges=6, L_c vars=6
[LP1]     -> 3 simple paths
[LP1]     -> 2 simple paths
[LP1]     -> 2 simple paths
[LP1]     -> 2 simple paths
[LP1]     -> 4 simple paths
[LP1]     -> 6 simple paths
[LP1] Reconvergent path 37: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=11, L_c vars=11
  Path 1: base=5.000000e+00, edges=10, L_c vars=10
  Path 2: base=0.000000e+00, edges=6, L_c vars=6
  Path 3: base=5.000000e+00, edges=10, L_c vars=10
  Path 4: base=5.000000e+00, edges=9, L_c vars=9
  Path 5: base=0.000000e+00, edges=5, L_c vars=5
[LP1]     -> 2 simple paths
[LP1]     -> 4 simple paths
[LP1]   Processing reconvergent path 41/44
[LP1]     -> 8 simple paths
[LP1] Reconvergent path 40: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=12, L_c vars=12
  Path 1: base=5.000000e+00, edges=11, L_c vars=11
  Path 2: base=0.000000e+00, edges=12, L_c vars=12
  Path 3: base=0.000000e+00, edges=7, L_c vars=7
  Path 4: base=5.000000e+00, edges=11, L_c vars=11
  Path 5: base=5.000000e+00, edges=10, L_c vars=10
  Path 6: base=0.000000e+00, edges=11, L_c vars=11
  Path 7: base=0.000000e+00, edges=6, L_c vars=6
[LP1]     -> 3 simple paths
[LP1]     -> 3 simple paths
[LP1] Reconvergent path 42: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=8, L_c vars=8
  Path 1: base=5.000000e+00, edges=7, L_c vars=7
  Path 2: base=0.000000e+00, edges=3, L_c vars=3
[LP1]   Processing reconvergent path 44/44
[LP1]     -> 4 simple paths
[LP1] Reconvergent path 43: base latencies differ by 5.000000e+00 (min=0.000000e+00, max=5.000000e+00)
  Path 0: base=5.000000e+00, edges=9, L_c vars=9
  Path 1: base=5.000000e+00, edges=8, L_c vars=8
  Path 2: base=0.000000e+00, edges=9, L_c vars=9
  Path 3: base=0.000000e+00, edges=4, L_c vars=4
[LP1] Adding sync cycle constraints (0 pairs)...
[LP1] Adding stall propagation constraints...
[LP1] Adding cycle time constraints...
[LP1]   CFDFC 0: 5 cycles, II_CFC = 1.000000e+00 (max base latency = 0.000000e+00)
[LP1] Setting objective...
[LP1] Setup complete.
=== Optimizing LP1 ===
LP1 optimization complete.
Channel mux0_outs_ins_fork2: extra_latency=0, stalled=0.000000e+00
Channel fork4_outs_1_lhs_cmpi0: extra_latency=0, stalled=0.000000e+00
Channel br1_outs_ins_fork1: extra_latency=0, stalled=0.000000e+00
Channel fork0_outs_2_ins_br2: extra_latency=0, stalled=0.000000e+00
Channel fork5_outs_0_condition_cond_br1: extra_latency=0, stalled=0.000000e+00
Channel fork0_outs_0_ctrl_constant1: extra_latency=0, stalled=0.000000e+00
Channel muli0_result_rhs_addi0: extra_latency=0, stalled=0.000000e+00
Channel fork4_outs_0_ins_trunci3: extra_latency=0, stalled=0.000000e+00
Channel cond_br2_trueOut_ins_1_mux1: extra_latency=0, stalled=0.000000e+00
Channel fork2_outs_2_ins_extsi6: extra_latency=1, stalled=0.000000e+00
Channel control_merge0_outs_data_cond_br3: extra_latency=1, stalled=0.000000e+00
Channel constant3_outs_ins_extsi8: extra_latency=0, stalled=0.000000e+00
Channel fork3_outs_1_index_mux1: extra_latency=4, stalled=0.000000e+00
Channel addi0_result_data_cond_br2: extra_latency=0, stalled=0.000000e+00
Channel fork3_outs_0_index_mux0: extra_latency=0, stalled=0.000000e+00
Channel br2_outs_ins_0_control_merge0: extra_latency=0, stalled=0.000000e+00
Channel trunci1_outs_addrIn_load0: extra_latency=0, stalled=0.000000e+00
Channel load0_dataOut_lhs_muli0: extra_latency=0, stalled=0.000000e+00
Channel source1_outs_ctrl_constant3: extra_latency=0, stalled=0.000000e+00
Channel fork1_outs_1_ins_extsi5: extra_latency=4, stalled=0.000000e+00
Channel fork2_outs_0_ins_trunci0: extra_latency=0, stalled=0.000000e+00
Channel extsi8_outs_rhs_addi1: extra_latency=0, stalled=0.000000e+00
Channel constant7_outs_ins_trunci2: extra_latency=0, stalled=0.000000e+00
Channel cond_br3_trueOut_ins_1_control_merge0: extra_latency=0, stalled=0.000000e+00
Channel cond_br1_trueOut_ins_1_mux0: extra_latency=0, stalled=0.000000e+00
Channel source2_outs_ctrl_constant7: extra_latency=0, stalled=0.000000e+00
Channel source0_outs_ctrl_constant2: extra_latency=0, stalled=0.000000e+00
Channel addi1_result_ins_fork4: extra_latency=0, stalled=0.000000e+00
Channel trunci0_outs_rhs_subi0: extra_latency=0, stalled=0.000000e+00
Channel extsi6_outs_lhs_addi1: extra_latency=0, stalled=0.000000e+00
Channel extsi7_outs_rhs_cmpi0: extra_latency=0, stalled=0.000000e+00
Channel cmpi0_result_ins_fork5: extra_latency=0, stalled=0.000000e+00
Channel load1_dataOut_rhs_muli0: extra_latency=0, stalled=0.000000e+00
Channel mux1_outs_lhs_addi0: extra_latency=1, stalled=0.000000e+00
Channel constant1_outs_ins_br1: extra_latency=0, stalled=0.000000e+00
Channel constant2_outs_ins_extsi7: extra_latency=0, stalled=0.000000e+00
Channel extsi4_outs_ins_0_mux0: extra_latency=0, stalled=0.000000e+00
Channel trunci3_outs_data_cond_br1: extra_latency=0, stalled=0.000000e+00
Channel fork2_outs_1_ins_trunci1: extra_latency=0, stalled=0.000000e+00
Channel fork5_outs_2_condition_cond_br3: extra_latency=0, stalled=0.000000e+00
Channel fork5_outs_1_condition_cond_br2: extra_latency=4, stalled=0.000000e+00
Channel control_merge0_index_ins_fork3: extra_latency=0, stalled=0.000000e+00
Channel subi0_result_addrIn_load1: extra_latency=0, stalled=0.000000e+00
Channel trunci2_outs_lhs_subi0: extra_latency=0, stalled=0.000000e+00
Channel extsi5_outs_ins_0_mux1: extra_latency=0, stalled=0.000000e+00
Channel fork1_outs_0_ins_extsi4: extra_latency=0, stalled=0.000000e+00
LP1 computed extra latencies for 46 channels
=== Setting up LP2 (Occupancy Balancing) ===
=== Occupancy Balancing (LP2) ===
[LP2] Setting up Occupancy Balancing LP...
[LP2]   Found 37 CFDFC channels
[LP2]   Target II = 1.000000e+00
[LP2]   Created 37 occupancy variables
[LP2]   Added 37 N_c >= L_c/II constraints
[LP2]   Skipping path sum constraints (relying on N_c >= L_c/II)
[LP2]   Added 3 cycle capacity constraints
[LP2] Setup complete.
LP2 optimization complete.
[LP2] Extracting results...
  fork2_outs_2_ins_extsi6: L=1, N=1, occ=1.000000e+00 -> DV=1, FIFO=0, R=0
  control_merge0_outs_data_cond_br3: L=1, N=1, occ=1.000000e+00 -> DV=1, FIFO=0, R=1
  cond_br2_trueOut_ins_1_mux1: L=0, N=1, occ=1.000000e+00 -> DV=0, FIFO=1, R=0
  fork3_outs_1_index_mux1: L=4, N=4, occ=4.000000e+00 -> DV=4, FIFO=0, R=0
  cond_br1_trueOut_ins_1_mux0: L=0, N=1, occ=1.000000e+00 -> DV=0, FIFO=1, R=0
  mux1_outs_lhs_addi0: L=1, N=1, occ=1.000000e+00 -> DV=1, FIFO=0, R=1
  fork5_outs_1_condition_cond_br2: L=4, N=4, occ=4.000000e+00 -> DV=4, FIFO=0, R=0
  cond_br3_trueOut_ins_1_control_merge0: L=0, N=1, occ=1.000000e+00 -> DV=0, FIFO=1, R=0
  Adding DV+R for merge-like: mux0_outs_ins_fork2
  Adding DV+R for merge-like: mux1_outs_lhs_addi0
  Adding DV+R for merge-like: control_merge0_outs_data_cond_br3
  Adding DV+R for merge-like: control_merge0_index_ins_fork3
Final buffer placement:
  fork2_outs_2_ins_extsi6: DV=1, FIFO=0, R=0
  control_merge0_outs_data_cond_br3: DV=1, FIFO=0, R=1
  cond_br2_trueOut_ins_1_mux1: DV=0, FIFO=1, R=0
  fork3_outs_1_index_mux1: DV=4, FIFO=0, R=0
  cond_br1_trueOut_ins_1_mux0: DV=0, FIFO=1, R=0
  mux1_outs_lhs_addi0: DV=1, FIFO=0, R=1
  fork5_outs_1_condition_cond_br2: DV=4, FIFO=0, R=0
  cond_br3_trueOut_ins_1_control_merge0: DV=0, FIFO=1, R=0
  mux0_outs_ins_fork2: DV=1, FIFO=0, R=1
  control_merge0_index_ins_fork3: DV=1, FIFO=0, R=1
=== FPGA24 Buffer Placement Complete ===
Placed buffers on 10 channels
mlir-asm-printer: Verifying operation: builtin.module

Also it's more modular now, take the logic out of the constructor, and a data flow graph describes a single thing, not an array of things
@ziadomalik ziadomalik changed the title [Buffers][NOT COMPLETE]Feat/ziad/milp [Buffers][NOT COMPLETE] Add latency and occupancy balancing constraints. Jan 17, 2026
@ziadomalik ziadomalik requested a review from Jiahui17 January 17, 2026 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants