Any consideration on why use 4 sp & 32 tp?

Hi, authors, great work!
I have a small question on the parallelism. It seems ring attention can hide the communication time under the local attn computation. So why still use more tensor parallelism than sequential parallelism, e.g. 32 tp vs. 4 sp during inference, instead the opposite? since the communication costs caused by TP cannot be ignored or overlapped.
Hope you can answer my question. Many thanks~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Any consideration on why use 4 sp & 32 tp? #74

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Any consideration on why use 4 sp & 32 tp? #74

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions