Skip to content

Conversation

@pierDipi
Copy link

@pierDipi pierDipi commented Nov 3, 2025

I'm getting

/tmp/nvshmem_src/src/host/team/team.cu:nvshmem_team_split_strided:63: NVSHMEM API called before NVSHMEM initialization has completed

Move nvshmem_barrier_all() to execute before nvshmem_team_split_strided rather than after. This is required because team split is a collective operation that must be called by all PEs in the parent team, and all PEs must reach this call in a synchronized manner.

Without the barrier after nvshmemx_init_attr(), ranks may complete initialization at different times, leading to race conditions where some PEs attempt to split teams before others have finished NVSHMEM initialization. This can cause undefined behavior and incorrect team formation.

The barrier after team split was unnecessary per NVSHMEM documentation: teams are immediately usable after creation without intervening synchronization.

…lization

Move nvshmem_barrier_all() to execute before nvshmem_team_split_strided
rather than after. This is required because team split is a collective
operation that must be called by all PEs in the parent team, and all
PEs must reach this call in a synchronized manner.

Without the barrier after nvshmemx_init_attr(), ranks may complete
initialization at different times, leading to race conditions where
some PEs attempt to split teams before others have finished NVSHMEM
initialization. This can cause undefined behavior and incorrect team
formation.

The barrier after team split was unnecessary per NVSHMEM documentation:
teams are immediately usable after creation without intervening
synchronization.

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant