Question Regarding Prefetch–Compute Overlap in InfiniGen Implementation

Hello,

Thank you for your great work on InfiniGen!

I have a question regarding the prefetch–compute overlap logic. According to **Figure 8: Operation flow of the prefetching module of InfiniGen**, the attention computation of **Layer (i − 1)** is expected to run *asynchronously* and *in parallel* with the prefetching for **Layer i**.

However, when reading the implementation in
`speedup/flexgen/infinigen/flex_opt.py`, specifically the `OptLM` class used during decoding, I found a potential bottleneck in `generation_loop_normal()`.

The loop contains the following sequence:

```python
for k in range(self.num_gpu_batches):
    self.load_cache(i, j, k, overlap=False)
    self.load_hidden(i, j, k)
    if (j in self.attn_layer[1:-1]) and (i > 0):
        self.sync()
    self.compute_layer(i, j, k)
    self.sync()
    self.store_hidden(i, j, k)
    self.store_cache(i, j, k, overlap=False)
    if j in self.attn_layer[1:-1] and (i > 0):
        self.prefetch_cache(i, j, k, overlap=True)
        self.prefetch_evt.record()
```

Because `torch.cuda.synchronize()` (called inside `self.sync()`) is a **device-wide barrier**, it synchronizes *all* CUDA streams on the GPU. As a result, the `sync()` after `compute_layer(i, j, k)` forces the prefetching (`prefetch_cache`) to wait until the attention compute finishes, preventing the overlap between compute and prefetch.

This behavior appears inconsistent with the pipeline depicted in the paper, where prefetching for Layer *i* should overlap with attention computation of Layer *(i − 1)*.

From what I can understand, the current implementation overlaps **prefetching only with the load operations**, but does not achieve the intended **compute–prefetch overlap** described in the paper.

Could you please clarify this discrepancy?
If I have misunderstood the intended execution model, I would greatly appreciate your correction.

Thank you very much for your time and for open-sourcing the implementation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question Regarding Prefetch–Compute Overlap in InfiniGen Implementation #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question Regarding Prefetch–Compute Overlap in InfiniGen Implementation #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions