Conversation
Transurgeon
left a comment
There was a problem hiding this comment.
really nice first iteration at cleaning this up with the matrix class.
One thing that could be added for this PR would be a constructor for right matmul (dense), what do you think?
The speedups with the batch matrix-vector products can be added later.
| cblas_dgemv(CblasRowMajor, CblasNoTrans, m, n, 1.0, dm->x, n, j_dense, 1, | ||
| 0.0, C->x + i, 1); | ||
| } |
There was a problem hiding this comment.
claude suggests that we could use batched matrix-vector (with dgemm) products with a batch size of 256.
This seems to enable the major speedups. But we can do that in another PR.
There was a problem hiding this comment.
Have you actually tried to see that it's significantly faster? It is certainly possible to do so, but it would probably require us to collect blocks of J first which likely would lead to much more complicated code.
There was a problem hiding this comment.
yeah I did, it makes a bit difference. but let's revisit this in another PR.
There was a problem hiding this comment.
Awesome! Let's do that in a new PR.
| static CSC_Matrix *dense_block_left_mult_sparsity(const Matrix *A, | ||
| const CSC_Matrix *J, int p) |
There was a problem hiding this comment.
claude suggests we could do this to avoid dynamic allocation with iVec:
The sparsity function uses iVec (a growable array) to accumulate row indices. Since dense
A means every non-empty block contributes exactly m rows, we know the nnz per column
upfront. A two-pass approach (count, then fill) avoids dynamic reallocation:
// Pass 1: count nnz per column
for (j = 0; j < k; j++) {
col_nnz = 0;
for each block with entries: col_nnz += m;
Cp[j+1] = Cp[j] + col_nnz;
}
// Pass 2: fill row indices directly into C->i
Transurgeon
left a comment
There was a problem hiding this comment.
I think we can merge this!
There seems to be some code formatting issues though.
uncommented windows, remember to add later