Skip to content

[MISC] Add support of opt-in shared memory for tiled hessian to improve performance.#2629

Open
duburcqa wants to merge 1 commit intoGenesis-Embodied-AI:mainfrom
duburcqa:optin_shared_memory
Open

[MISC] Add support of opt-in shared memory for tiled hessian to improve performance.#2629
duburcqa wants to merge 1 commit intoGenesis-Embodied-AI:mainfrom
duburcqa:optin_shared_memory

Conversation

@duburcqa
Copy link
Copy Markdown
Collaborator

@duburcqa duburcqa commented Mar 31, 2026

Related Issue

Resolves #2626

Checklist:

  • I read the CONTRIBUTING document.
  • I followed the Submitting Code Changes section of CONTRIBUTING document.
  • I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
  • I updated the documentation accordingly or no change is needed.
  • I tested my changes and added instructions on how to test it for reviewers.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@duburcqa duburcqa requested a review from YilingQiao as a code owner March 31, 2026 10:25
@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Mar 31, 2026

This snippet is crashing on CUDA for now, preventing this PR to pass.

import quadrants as qd

qd.init(arch=qd.cuda, debug=False, cfg_optimization=False)

@qd.kernel
def func_solve_init(
    nt_H: qd.types.ndarray,
):
    BLOCK_DIM = qd.static(64)
    MAX_DOFS = qd.static(111)  # Slightly over 48Kb, 110 would pass

    n_dofs = nt_H.shape[1]
    n_dofs_2 = n_dofs**2
    n_lower_tri = n_dofs * (n_dofs + 1) // 2

    qd.loop_config(block_dim=BLOCK_DIM)
    for tid in range(BLOCK_DIM):
        H = qd.simt.block.SharedArray((MAX_DOFS, MAX_DOFS + 1), qd.f32)

        i_pair = tid
        while i_pair < n_lower_tri:
            i_d1 = qd.cast(qd.floor((qd.sqrt(qd.cast(8 * i_pair + 1, qd.f32)) - 1.0) / 2.0), qd.i32)
            if (i_d1 + 1) * (i_d1 + 2) // 2 <= i_pair:
                i_d1 = i_d1 + 1
            i_d2 = i_pair - i_d1 * (i_d1 + 1) // 2
            H[i_d1, i_d2] = nt_H[0, i_d1, i_d2]
            i_pair = i_pair + BLOCK_DIM

    qd.loop_config(block_dim=BLOCK_DIM)
    for tid in range(BLOCK_DIM):
        H = qd.simt.block.SharedArray((MAX_DOFS, MAX_DOFS + 1), qd.f32)

        i_flat = tid
        while i_flat < n_dofs_2:
            i_d1 = i_flat // n_dofs
            i_d2 = i_flat % n_dofs
            if i_d2 <= i_d1:
                H[i_d1, i_d2] = nt_H[0, i_d1, i_d2]
            i_flat = i_flat + BLOCK_DIM


nt_H = qd.ndarray(dtype=qd.f32, shape=(1, 102, 102))

func_solve_init(nt_H)

Fixed by Genesis-Embodied-AI/quadrants#442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support tiled Cholesky for systems with >96 DOFs by opting in to extended GPU shared memory (>48KB)

1 participant