Optimization via Minimal Toeplitz#2
Optimization via Minimal Toeplitz#2GrandTheftWalrus wants to merge 34 commits intoRichieHakim:mainfrom
Conversation
…n't work with a differnet test matrix, and also it currently still scales horrible because the time complexity optimizations haven't been made (it's doing the matrix multiplication in series in Python using the CPU)
…t where it's doing all the multiplication in series rather than parallel. I still need to fix how the output is inaccurate with certain input dimensions, but that shouldn't affect the speed
…eedup wasn't real
…t not good enough yet. I think it's still about the same as regular convolution)
… factor of N/K or something, bringing it to O(sKN^3)
…rtions from the original constructor to the new one
…ping the input and output
|
Nice work. You significantly improved the Toeplitz approach. I am happy to integrate these changes. I haven't fully dug in yet, but I did do some cursory checks on the accuracy of the outputs and the speed relative to the code we were talking about in the issue. It looks like the results are accurate, however, the speed appears to be slower than the code sketch I provided. Here is the benchmarking code: import time
import math
import numpy as np
import matplotlib.pyplot as plt
import scipy
import scipy.sparse
# %load_ext autoreload
# %autoreload 2
import sparse_convolution as sc
def sparse_convolve(X, k):
X_coo = X.tocoo()
k_coo = scipy.sparse.coo_matrix(k)
k_coo.row = k_coo.row - int(math.ceil(k.shape[0] / 2)) + 1
k_coo.col = k_coo.col - int(math.ceil(k.shape[1] / 2)) + 1
dtype_idx = np.int32 if np.iinfo(np.int32).max > max(X_coo.shape)else np.int64
idx = np.empty(shape=(2, X_coo.nnz * k_coo.nnz), dtype=dtype_idx)
idx[0] = (k_coo.row[None, :] + X_coo.row[:, None]).ravel()
idx[1] = (k_coo.col[None, :] + X_coo.col[:, None]).ravel()
data = (k_coo.data[None, :] * X_coo.data[:, None]).ravel()
idx_valid = np.all((idx >= 0) & (idx < np.array(X.shape)[:, None]), axis=0)
out_csr = scipy.sparse.coo_matrix((data[idx_valid], (idx[0][idx_valid], idx[1][idx_valid])), shape=X.shape)
out_csr = out_csr.tocsr()
out_csr.sum_duplicates()
return out_csr
import concurrent.futures
def run_with_timeout(func, params, timeout):
with concurrent.futures.ProcessPoolExecutor() as executor:
# Submit the function to the executor
future = executor.submit(func, *params)
try:
# Wait for the function to complete within the timeout
result = future.result(timeout=timeout)
return result
except concurrent.futures.TimeoutError:
# If timeout occurs, terminate the process
executor.shutdown(wait=False, cancel_futures=True)
return "Function timed out and was terminated."
runtimes = {}
for shape_X in [
(10, 10),
(10000, 10),
(10, 10000),
(1000, 1000),
(100, 100000),
(100000, 100),
(5000, 5000),
(30000, 30000),
]:
for density in [
0.0001,
0.001,
0.01,
0.1,
1.0,
]:
def get_X(shape_X, density):
return scipy.sparse.random(m=shape_X[0], n=shape_X[1], density=density, format='csr')
n_nz = int(np.prod(shape_X) * density)
if n_nz > 1000000:
continue
X = run_with_timeout(get_X, (shape_X, density), 60)
if isinstance(X, str):
if X == "Function timed out and was terminated.":
print(f"Timeout creating X with shape_X={shape_X}, density={density}")
continue
for shape_kernel in [
(1, 1),
(2, 2),
(5, 5),
(14, 14),
(25, 25),
]:
if n_nz >= 1000000:
if shape_kernel[0] > 5:
continue
k = np.random.randn(shape_kernel[0], shape_kernel[1])
def run_comparison(X, k):
tic = time.time()
conv = sc.Toeplitz_convolution2d(
x_shape=shape_X,
k=k,
mode='same',
dtype=None,
# verbose=2,
verbose=False,
)
out = conv(
x=X,
batching=False,
)
toc = time.time() - tic
out_rich = sparse_convolve(X, k)
toc2 = (time.time() - tic) - toc
return toc, toc2, toc / toc2
# out = run_with_timeout(run_comparison, (X, k), 60)
# if isinstance(out, str):
# if out == "Function timed out and was terminated.":
# print(f"Timeout running comparison with shape_X={shape_X}, shape_kernel={shape_kernel}, density={density}")
# continue
# else:
# toc, toc2, ratio = out
toc, toc2, ratio = run_comparison(X, k)
runtimes[(shape_X, shape_kernel, density)] = (toc, toc2, ratio)
print(f"shape_X={shape_X}, shape_kernel={shape_kernel}, density={density}, n_nz={n_nz}")
print(f"conv walrus: {toc:.3f}s")
print(f"conv broadcast: {toc2:.3f}s. Ratio: {ratio:.3f}x")
runtimesHere are the outputs: As is, I'll happily reward the basic bounty reward for this work, and give partial credit on both the time complexity and testing suite portions: $50. The dings are due to the slower speed relative to code I provided, and the testing suite would need to be revised. Let me know if you want to make any revisions before I start making edits. |
|
Hmm indeed the broadcasting does seem to be faster. It probably can't be beat because it's very simple and efficient. I'm outside right now so I can't take a gander for myself, but I think the minimal toeplitz method might at least be better suited for batching. If I'm not mistaken, the broadcasting method might run out of memory with significant batch sizes (and possibly take longer). If this is true (and you would prefer speed over batching ability) would it qualify for the full reward? If not, I'll take zee $50 |
|
You are right that the Toeplitz approach scales sub O(batch_size). Unfortunately, a quick comparison shows that the old Toeplitz implementation scales better than the new one with increasing batch_size. I see a cross over in runtime with parameters: shape_X = (75, 1500 * 1500) ## (batch, height, width)
sparsity_X = 0.001
shape_kernel = (5, 5)
def fast_sparse_rand(m, n, density):
rows = np.random.randint(0, m, int(m * n * density))
cols = np.random.randint(0, n, int(m * n * density))
vals = np.random.randn(int(m * n * density))
return scipy.sparse.coo_matrix((vals, (rows, cols)), shape=(m, n)).tocsr()
X = fast_sparse_rand(shape_X[0], shape_X[1], sparsity_X)
k = np.ones((shape_kernel[0], shape_kernel[1]))batches larger than 75 show the old implementation is faster. I don't mean to shortchange you. Let's look again at the conditions:
Your implementation does not 'allow for parameter sharing' via 'tiling and stitching' or 'basic memory sharing'.
Your implementation improves runtime over the code implemented in the repo under reasonable conditions, but it is slower than code I provided for low / single batch sizes, and will likely to be slower than the existing implementation for large batch sizes.
From what I have seen, I think you did a good job, but I have to make some edits. My accounting is roughly |
|
@RichieHakim That sounds good to me 😎 I'll take the offer. I was mostly doing it for the challenge anyway rather than the reward. |
|
@RichieHakim Do you know when you will update and close the Opire bounty |
...send me your venmo or preferred payment method via email. I need to make some changes to this PR before merging it and I haven't found the time to do that. Opire waits for the issue to be closed. The problem isn't really solved, so I left the issue open. |
hello is this project still ongoing. to jump on and in the long term also |
|
Hi @RichieHakim, I'm the co-founder of Opire! I've been keeping a close eye on the bounty you put on Opire and the issue. I was thinking that you might be more interested in proposing this as a Challenge instead of a reward for solving the issue. Challenges are a new feature of Opire that we launched recently. Maybe it could be an "ongoing task" type Challenge and offer the incentive every time someone improves the code performance or something like that. Anyway, I love seeing the community collaborate as they do on this issue, congratulations, good job everyone! |
Nicely said sir thanks for this opportunity from both sides. |





It basically just constructs the Double Toeplitz, but only the columns that will end up being involved in the matrix multplication. It does this by seeing which rows of the reshaped input matrices are nonzero, and then iterating (in parallel) through the corresponding columns of the double toeplitz to calculate the values only where the nonzero values should be (by doing some quick maffs). The sparser the input matrix, the less memory it uses and the faster she goes.
Also, I think in the current version of the program, using
dtype=intresults in the output being all zeros. It's fixed in this version though.P.S. Can I get the $100 reward?
Some comparison examples (abridged)