We have a baseline of rmm-tree and corresponding enclose/forward search operations. Currently they are very slow, need to investigate possibility of accelerating rmm-tree calculations over a single block via SIMD.
Probably 4-bit/8-bit popcount + universal lookup might be a good search direction for that.