Skip to content

Optimization via Minimal Toeplitz#2

Open
GrandTheftWalrus wants to merge 34 commits intoRichieHakim:mainfrom
GrandTheftWalrus:main
Open

Optimization via Minimal Toeplitz#2
GrandTheftWalrus wants to merge 34 commits intoRichieHakim:mainfrom
GrandTheftWalrus:main

Conversation

@GrandTheftWalrus
Copy link

@GrandTheftWalrus GrandTheftWalrus commented Nov 25, 2024

It basically just constructs the Double Toeplitz, but only the columns that will end up being involved in the matrix multplication. It does this by seeing which rows of the reshaped input matrices are nonzero, and then iterating (in parallel) through the corresponding columns of the double toeplitz to calculate the values only where the nonzero values should be (by doing some quick maffs). The sparser the input matrix, the less memory it uses and the faster she goes.

Also, I think in the current version of the program, using dtype=int results in the output being all zeros. It's fixed in this version though.

P.S. Can I get the $100 reward?

Some comparison examples (abridged)

Testing shape: (512, 512, 32, 32), batch_size: 1, density: 0.01, mode: same, matrix type: dense (9/56)
Time taken (new):	           0.08s
Time taken (old):	           6.42s
Speedup vs. old:       77.43x
Conv2d time:        	           0.39s
Speedup vs. conv2d:     4.71x

Testing shape: (512, 512, 32, 32), batch_size: 1, density: 0.01, mode: same, matrix type: sparse (10/56)
Time taken (new):	           0.10s
Time taken (old):	           6.84s
Speedup vs. old:       67.05x
Conv2d time:        	           0.39s
Speedup vs. conv2d:     3.83x

Testing shape: (1000, 1000, 3, 3), batch_size: 1, density: 0.01, mode: same, matrix type: dense (11/56)
Time taken (new):	           0.01s
Time taken (old):	           0.30s
Speedup vs. old:       48.38x
Conv2d time:        	           0.02s
Speedup vs. conv2d:     3.47x

Testing shape: (1000, 1000, 3, 3), batch_size: 1, density: 0.01, mode: same, matrix type: sparse (12/56)
Time taken (new):	           0.01s
Time taken (old):	           0.32s
Speedup vs. old:       23.16x
Conv2d time:        	           0.02s
Speedup vs. conv2d:     1.60x

Testing shape: (4000, 4000, 3, 3), batch_size: 1, density: 0.01, mode: same, matrix type: dense (13/56)
Time taken (new):	           0.20s
Time taken (old):	           2.73s
Speedup vs. old:       13.58x
Conv2d time:        	           0.37s
Speedup vs. conv2d:     1.83x

Testing shape: (4000, 4000, 3, 3), batch_size: 1, density: 0.01, mode: same, matrix type: sparse (14/56)
Time taken (new):	           0.29s
Time taken (old):	           2.94s
Speedup vs. old:        9.98x
Conv2d time:        	           0.37s
Speedup vs. conv2d:     1.27x

Testing shape: (100, 100, 2, 2), batch_size: 1, density: 0.0001, mode: same, matrix type: dense (15/56)
Time taken (new):	           0.00s
Time taken (old):	           0.02s
Speedup vs. old:       72.60x
Conv2d time:        	           0.00s
Speedup vs. conv2d:     1.67x

Testing shape: (100, 100, 10, 10), batch_size: 1, density: 0.0001, mode: same, matrix type: dense (17/56)
Time taken (new):	           0.00s
Time taken (old):	           0.03s
Speedup vs. old:      131.19x
Conv2d time:        	           0.00s
Speedup vs. conv2d:     6.95x

Testing shape: (100, 100, 10, 10), batch_size: 1, density: 0.0001, mode: same, matrix type: sparse (18/56)
Time taken (new):	           0.00s
Time taken (old):	           0.03s
Speedup vs. old:       35.05x
Conv2d time:        	           0.00s
Speedup vs. conv2d:     1.76x

Testing shape: (50, 150, 5, 5), batch_size: 1, density: 0.0001, mode: same, matrix type: dense (19/56)
Time taken (new):	           0.00s
Time taken (old):	           0.01s
Speedup vs. old:       51.31x
Conv2d time:        	           0.00s
Speedup vs. conv2d:     1.53x

Testing shape: (512, 512, 32, 32), batch_size: 1, density: 0.0001, mode: same, matrix type: dense (23/56)
Time taken (new):	           0.00s
Time taken (old):	           6.48s
Speedup vs. old:     4340.45x
Conv2d time:        	           0.38s
Speedup vs. conv2d:   254.18x

Testing shape: (512, 512, 32, 32), batch_size: 1, density: 0.0001, mode: same, matrix type: sparse (24/56)
Time taken (new):	           0.00s
Time taken (old):	           6.61s
Speedup vs. old:     1727.86x
Conv2d time:        	           0.38s
Speedup vs. conv2d:    98.65x

…n't work with a differnet test matrix, and also it currently still scales horrible because the time complexity optimizations haven't been made (it's doing the matrix multiplication in series in Python using the CPU)
…t where it's doing all the multiplication in series rather than parallel. I still need to fix how the output is inaccurate with certain input dimensions, but that shouldn't affect the speed
…t not good enough yet. I think it's still about the same as regular convolution)
… factor of N/K or something, bringing it to O(sKN^3)
…rtions from the original constructor to the new one
@RichieHakim
Copy link
Owner

RichieHakim commented Nov 25, 2024

Nice work. You significantly improved the Toeplitz approach. I am happy to integrate these changes.

I haven't fully dug in yet, but I did do some cursory checks on the accuracy of the outputs and the speed relative to the code we were talking about in the issue. It looks like the results are accurate, however, the speed appears to be slower than the code sketch I provided.

Here is the benchmarking code:

import time
import math

import numpy as np
import matplotlib.pyplot as plt
import scipy
import scipy.sparse

# %load_ext autoreload
# %autoreload 2
import sparse_convolution as sc


def sparse_convolve(X, k):
    X_coo = X.tocoo()
    k_coo = scipy.sparse.coo_matrix(k)

    k_coo.row = k_coo.row - int(math.ceil(k.shape[0] / 2)) + 1
    k_coo.col = k_coo.col - int(math.ceil(k.shape[1] / 2)) + 1

    dtype_idx = np.int32 if np.iinfo(np.int32).max > max(X_coo.shape)else np.int64

    idx = np.empty(shape=(2, X_coo.nnz * k_coo.nnz), dtype=dtype_idx)
    idx[0] = (k_coo.row[None, :] + X_coo.row[:, None]).ravel()
    idx[1] = (k_coo.col[None, :] + X_coo.col[:, None]).ravel()
    data = (k_coo.data[None, :] * X_coo.data[:, None]).ravel()

    idx_valid = np.all((idx >= 0) & (idx < np.array(X.shape)[:, None]), axis=0)
    out_csr = scipy.sparse.coo_matrix((data[idx_valid], (idx[0][idx_valid], idx[1][idx_valid])), shape=X.shape)

    out_csr = out_csr.tocsr()
    out_csr.sum_duplicates()
    
    return out_csr


import concurrent.futures

def run_with_timeout(func, params, timeout):
    with concurrent.futures.ProcessPoolExecutor() as executor:
        # Submit the function to the executor
        future = executor.submit(func, *params)
        try:
            # Wait for the function to complete within the timeout
            result = future.result(timeout=timeout)
            return result
        except concurrent.futures.TimeoutError:
            # If timeout occurs, terminate the process
            executor.shutdown(wait=False, cancel_futures=True)
            return "Function timed out and was terminated."


runtimes = {}
for shape_X in [
    (10, 10),
    (10000, 10),
    (10, 10000),
    (1000, 1000),
    (100, 100000),
    (100000, 100),
    (5000, 5000),
    (30000, 30000),
]:
    for density in [
        0.0001,
        0.001,
        0.01,
        0.1,
        1.0,
    ]:
        def get_X(shape_X, density):
            return scipy.sparse.random(m=shape_X[0], n=shape_X[1], density=density, format='csr')
        
        n_nz = int(np.prod(shape_X) * density)        
        if n_nz > 1000000:
            continue

        X = run_with_timeout(get_X, (shape_X, density), 60)
        if isinstance(X, str):
            if X == "Function timed out and was terminated.":
                print(f"Timeout creating X with shape_X={shape_X}, density={density}")
                continue

        for shape_kernel in [
            (1, 1),
            (2, 2),
            (5, 5),
            (14, 14),
            (25, 25),
        ]:
            if n_nz >= 1000000:
                if shape_kernel[0] > 5:
                    continue

            k = np.random.randn(shape_kernel[0], shape_kernel[1])

            def run_comparison(X, k):
                tic = time.time()
                conv = sc.Toeplitz_convolution2d(
                    x_shape=shape_X,
                    k=k,
                    mode='same',
                    dtype=None,
                    # verbose=2,
                    verbose=False,
                )
                out = conv(
                    x=X,
                    batching=False,
                )
                toc = time.time() - tic

                out_rich = sparse_convolve(X, k)
                toc2 = (time.time() - tic) - toc
                return toc, toc2, toc / toc2

            # out = run_with_timeout(run_comparison, (X, k), 60)
            # if isinstance(out, str):
            #     if out == "Function timed out and was terminated.":
            #         print(f"Timeout running comparison with shape_X={shape_X}, shape_kernel={shape_kernel}, density={density}")
            #         continue
            # else:
            #     toc, toc2, ratio = out

            toc, toc2, ratio = run_comparison(X, k)

            runtimes[(shape_X, shape_kernel, density)] = (toc, toc2, ratio)
            print(f"shape_X={shape_X}, shape_kernel={shape_kernel}, density={density}, n_nz={n_nz}")
            print(f"conv walrus:    {toc:.3f}s")
            print(f"conv broadcast: {toc2:.3f}s.  Ratio: {ratio:.3f}x")

runtimes

Here are the outputs:

shape_X=(10, 10), shape_kernel=(1, 1), density=0.0001, n_nz=0
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 4.756x
shape_X=(10, 10), shape_kernel=(2, 2), density=0.0001, n_nz=0
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 2.582x
shape_X=(10, 10), shape_kernel=(5, 5), density=0.0001, n_nz=0
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 2.708x
shape_X=(10, 10), shape_kernel=(14, 14), density=0.0001, n_nz=0
conv walrus:    0.000s
conv broadcast: 0.000s.  Ratio: 2.569x
shape_X=(10, 10), shape_kernel=(25, 25), density=0.0001, n_nz=0
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 2.270x
shape_X=(10, 10), shape_kernel=(1, 1), density=0.001, n_nz=0
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 3.889x
shape_X=(10, 10), shape_kernel=(2, 2), density=0.001, n_nz=0
conv walrus:    0.000s
conv broadcast: 0.000s.  Ratio: 2.466x
shape_X=(10, 10), shape_kernel=(5, 5), density=0.001, n_nz=0
conv walrus:    0.000s
conv broadcast: 0.000s.  Ratio: 2.367x
shape_X=(10, 10), shape_kernel=(14, 14), density=0.001, n_nz=0
conv walrus:    0.000s
conv broadcast: 0.000s.  Ratio: 2.555x
shape_X=(10, 10), shape_kernel=(25, 25), density=0.001, n_nz=0
conv walrus:    0.000s
conv broadcast: 0.000s.  Ratio: 2.445x
shape_X=(10, 10), shape_kernel=(1, 1), density=0.01, n_nz=1
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 3.923x
shape_X=(10, 10), shape_kernel=(2, 2), density=0.01, n_nz=1
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 4.143x
shape_X=(10, 10), shape_kernel=(5, 5), density=0.01, n_nz=1
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 3.150x
shape_X=(10, 10), shape_kernel=(14, 14), density=0.01, n_nz=1
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 3.044x
shape_X=(10, 10), shape_kernel=(25, 25), density=0.01, n_nz=1
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 3.099x
shape_X=(10, 10), shape_kernel=(1, 1), density=0.1, n_nz=10
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 4.593x
shape_X=(10, 10), shape_kernel=(2, 2), density=0.1, n_nz=10
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 3.942x
shape_X=(10, 10), shape_kernel=(5, 5), density=0.1, n_nz=10
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 2.275x
shape_X=(10, 10), shape_kernel=(14, 14), density=0.1, n_nz=10
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 2.595x
shape_X=(10, 10), shape_kernel=(25, 25), density=0.1, n_nz=10
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 3.168x
shape_X=(10, 10), shape_kernel=(1, 1), density=1.0, n_nz=100
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 4.325x
shape_X=(10, 10), shape_kernel=(2, 2), density=1.0, n_nz=100
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 3.729x
shape_X=(10, 10), shape_kernel=(5, 5), density=1.0, n_nz=100
conv walrus:    0.001s
conv broadcast: 0.000s.  Ratio: 2.526x
shape_X=(10, 10), shape_kernel=(14, 14), density=1.0, n_nz=100
conv walrus:    0.002s
conv broadcast: 0.001s.  Ratio: 2.462x
shape_X=(10, 10), shape_kernel=(25, 25), density=1.0, n_nz=100
conv walrus:    0.003s
conv broadcast: 0.001s.  Ratio: 2.931x
shape_X=(10000, 10), shape_kernel=(1, 1), density=0.0001, n_nz=10
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 8.742x
shape_X=(10000, 10), shape_kernel=(2, 2), density=0.0001, n_nz=10
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 6.634x
shape_X=(10000, 10), shape_kernel=(5, 5), density=0.0001, n_nz=10
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 7.619x
shape_X=(10000, 10), shape_kernel=(14, 14), density=0.0001, n_nz=10
conv walrus:    0.003s
conv broadcast: 0.000s.  Ratio: 8.904x
shape_X=(10000, 10), shape_kernel=(25, 25), density=0.0001, n_nz=10
conv walrus:    0.004s
conv broadcast: 0.000s.  Ratio: 13.071x
shape_X=(10000, 10), shape_kernel=(1, 1), density=0.001, n_nz=100
conv walrus:    0.003s
conv broadcast: 0.000s.  Ratio: 8.094x
shape_X=(10000, 10), shape_kernel=(2, 2), density=0.001, n_nz=100
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 8.281x
shape_X=(10000, 10), shape_kernel=(5, 5), density=0.001, n_nz=100
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 5.629x
shape_X=(10000, 10), shape_kernel=(14, 14), density=0.001, n_nz=100
conv walrus:    0.004s
conv broadcast: 0.001s.  Ratio: 6.322x
shape_X=(10000, 10), shape_kernel=(25, 25), density=0.001, n_nz=100
conv walrus:    0.007s
conv broadcast: 0.001s.  Ratio: 7.536x
shape_X=(10000, 10), shape_kernel=(1, 1), density=0.01, n_nz=1000
conv walrus:    0.003s
conv broadcast: 0.000s.  Ratio: 6.947x
shape_X=(10000, 10), shape_kernel=(2, 2), density=0.01, n_nz=1000
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 4.945x
shape_X=(10000, 10), shape_kernel=(5, 5), density=0.01, n_nz=1000
conv walrus:    0.003s
conv broadcast: 0.001s.  Ratio: 4.307x
shape_X=(10000, 10), shape_kernel=(14, 14), density=0.01, n_nz=1000
conv walrus:    0.016s
conv broadcast: 0.003s.  Ratio: 4.877x
shape_X=(10000, 10), shape_kernel=(25, 25), density=0.01, n_nz=1000
conv walrus:    0.042s
conv broadcast: 0.008s.  Ratio: 5.112x
shape_X=(10000, 10), shape_kernel=(1, 1), density=0.1, n_nz=10000
conv walrus:    0.004s
conv broadcast: 0.001s.  Ratio: 7.554x
shape_X=(10000, 10), shape_kernel=(2, 2), density=0.1, n_nz=10000
conv walrus:    0.005s
conv broadcast: 0.002s.  Ratio: 3.113x
shape_X=(10000, 10), shape_kernel=(5, 5), density=0.1, n_nz=10000
conv walrus:    0.016s
conv broadcast: 0.008s.  Ratio: 2.016x
shape_X=(10000, 10), shape_kernel=(14, 14), density=0.1, n_nz=10000
conv walrus:    0.105s
conv broadcast: 0.050s.  Ratio: 2.109x
shape_X=(10000, 10), shape_kernel=(25, 25), density=0.1, n_nz=10000
conv walrus:    0.339s
conv broadcast: 0.126s.  Ratio: 2.688x
shape_X=(10000, 10), shape_kernel=(1, 1), density=1.0, n_nz=100000
conv walrus:    0.010s
conv broadcast: 0.002s.  Ratio: 6.255x
shape_X=(10000, 10), shape_kernel=(2, 2), density=1.0, n_nz=100000
conv walrus:    0.024s
conv broadcast: 0.011s.  Ratio: 2.220x
shape_X=(10000, 10), shape_kernel=(5, 5), density=1.0, n_nz=100000
conv walrus:    0.120s
conv broadcast: 0.070s.  Ratio: 1.726x
shape_X=(10000, 10), shape_kernel=(14, 14), density=1.0, n_nz=100000
conv walrus:    1.011s
conv broadcast: 0.508s.  Ratio: 1.989x
shape_X=(10000, 10), shape_kernel=(25, 25), density=1.0, n_nz=100000
conv walrus:    3.263s
conv broadcast: 1.376s.  Ratio: 2.372x
shape_X=(10, 10000), shape_kernel=(1, 1), density=0.0001, n_nz=10
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 8.524x
shape_X=(10, 10000), shape_kernel=(2, 2), density=0.0001, n_nz=10
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 7.954x
shape_X=(10, 10000), shape_kernel=(5, 5), density=0.0001, n_nz=10
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 7.281x
shape_X=(10, 10000), shape_kernel=(14, 14), density=0.0001, n_nz=10
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 8.470x
shape_X=(10, 10000), shape_kernel=(25, 25), density=0.0001, n_nz=10
conv walrus:    0.003s
conv broadcast: 0.000s.  Ratio: 9.703x
shape_X=(10, 10000), shape_kernel=(1, 1), density=0.001, n_nz=100
conv walrus:    0.003s
conv broadcast: 0.000s.  Ratio: 10.026x
shape_X=(10, 10000), shape_kernel=(2, 2), density=0.001, n_nz=100
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 6.124x
shape_X=(10, 10000), shape_kernel=(5, 5), density=0.001, n_nz=100
conv walrus:    0.002s
conv broadcast: 0.000s.  Ratio: 6.619x
shape_X=(10, 10000), shape_kernel=(14, 14), density=0.001, n_nz=100
conv walrus:    0.003s
conv broadcast: 0.001s.  Ratio: 5.049x
shape_X=(10, 10000), shape_kernel=(25, 25), density=0.001, n_nz=100
conv walrus:    0.006s
conv broadcast: 0.001s.  Ratio: 4.505x
shape_X=(10, 10000), shape_kernel=(1, 1), density=0.01, n_nz=1000
conv walrus:    0.003s
conv broadcast: 0.000s.  Ratio: 9.642x
shape_X=(10, 10000), shape_kernel=(2, 2), density=0.01, n_nz=1000
conv walrus:    0.003s
conv broadcast: 0.000s.  Ratio: 6.238x
shape_X=(10, 10000), shape_kernel=(5, 5), density=0.01, n_nz=1000
conv walrus:    0.003s
conv broadcast: 0.001s.  Ratio: 3.237x
shape_X=(10, 10000), shape_kernel=(14, 14), density=0.01, n_nz=1000
conv walrus:    0.013s
conv broadcast: 0.006s.  Ratio: 2.138x
shape_X=(10, 10000), shape_kernel=(25, 25), density=0.01, n_nz=1000
conv walrus:    0.037s
conv broadcast: 0.014s.  Ratio: 2.744x
shape_X=(10, 10000), shape_kernel=(1, 1), density=0.1, n_nz=10000
conv walrus:    0.004s
conv broadcast: 0.000s.  Ratio: 8.730x
shape_X=(10, 10000), shape_kernel=(2, 2), density=0.1, n_nz=10000
conv walrus:    0.005s
conv broadcast: 0.002s.  Ratio: 2.027x
shape_X=(10, 10000), shape_kernel=(5, 5), density=0.1, n_nz=10000
conv walrus:    0.015s
conv broadcast: 0.010s.  Ratio: 1.460x
shape_X=(10, 10000), shape_kernel=(14, 14), density=0.1, n_nz=10000
conv walrus:    0.107s
conv broadcast: 0.077s.  Ratio: 1.393x
shape_X=(10, 10000), shape_kernel=(25, 25), density=0.1, n_nz=10000
conv walrus:    0.375s
conv broadcast: 0.177s.  Ratio: 2.118x
shape_X=(10, 10000), shape_kernel=(1, 1), density=1.0, n_nz=100000
conv walrus:    0.011s
conv broadcast: 0.002s.  Ratio: 5.791x
shape_X=(10, 10000), shape_kernel=(2, 2), density=1.0, n_nz=100000
conv walrus:    0.025s
conv broadcast: 0.032s.  Ratio: 0.776x
shape_X=(10, 10000), shape_kernel=(5, 5), density=1.0, n_nz=100000
conv walrus:    0.121s
conv broadcast: 0.121s.  Ratio: 0.999x
shape_X=(10, 10000), shape_kernel=(14, 14), density=1.0, n_nz=100000
conv walrus:    1.090s
conv broadcast: 0.890s.  Ratio: 1.225x
shape_X=(10, 10000), shape_kernel=(25, 25), density=1.0, n_nz=100000
conv walrus:    3.445s
conv broadcast: 2.139s.  Ratio: 1.611x
shape_X=(1000, 1000), shape_kernel=(1, 1), density=0.0001, n_nz=100
conv walrus:    0.011s
conv broadcast: 0.000s.  Ratio: 41.915x
shape_X=(1000, 1000), shape_kernel=(2, 2), density=0.0001, n_nz=100
conv walrus:    0.010s
conv broadcast: 0.000s.  Ratio: 31.169x
shape_X=(1000, 1000), shape_kernel=(5, 5), density=0.0001, n_nz=100
conv walrus:    0.010s
conv broadcast: 0.000s.  Ratio: 32.118x
shape_X=(1000, 1000), shape_kernel=(14, 14), density=0.0001, n_nz=100
conv walrus:    0.010s
conv broadcast: 0.001s.  Ratio: 15.328x
shape_X=(1000, 1000), shape_kernel=(25, 25), density=0.0001, n_nz=100
conv walrus:    0.013s
conv broadcast: 0.002s.  Ratio: 8.306x
shape_X=(1000, 1000), shape_kernel=(1, 1), density=0.001, n_nz=1000
conv walrus:    0.011s
conv broadcast: 0.000s.  Ratio: 34.471x
shape_X=(1000, 1000), shape_kernel=(2, 2), density=0.001, n_nz=1000
conv walrus:    0.010s
conv broadcast: 0.000s.  Ratio: 24.919x
shape_X=(1000, 1000), shape_kernel=(5, 5), density=0.001, n_nz=1000
conv walrus:    0.011s
conv broadcast: 0.001s.  Ratio: 11.530x
shape_X=(1000, 1000), shape_kernel=(14, 14), density=0.001, n_nz=1000
conv walrus:    0.022s
conv broadcast: 0.006s.  Ratio: 3.740x
shape_X=(1000, 1000), shape_kernel=(25, 25), density=0.001, n_nz=1000
conv walrus:    0.049s
conv broadcast: 0.021s.  Ratio: 2.328x
shape_X=(1000, 1000), shape_kernel=(1, 1), density=0.01, n_nz=10000
conv walrus:    0.013s
conv broadcast: 0.000s.  Ratio: 26.765x
shape_X=(1000, 1000), shape_kernel=(2, 2), density=0.01, n_nz=10000
conv walrus:    0.013s
conv broadcast: 0.002s.  Ratio: 8.030x
shape_X=(1000, 1000), shape_kernel=(5, 5), density=0.01, n_nz=10000
conv walrus:    0.026s
conv broadcast: 0.009s.  Ratio: 2.905x
shape_X=(1000, 1000), shape_kernel=(14, 14), density=0.01, n_nz=10000
conv walrus:    0.124s
conv broadcast: 0.081s.  Ratio: 1.532x
shape_X=(1000, 1000), shape_kernel=(25, 25), density=0.01, n_nz=10000
conv walrus:    0.378s
conv broadcast: 0.306s.  Ratio: 1.233x
shape_X=(1000, 1000), shape_kernel=(1, 1), density=0.1, n_nz=100000
conv walrus:    0.023s
conv broadcast: 0.002s.  Ratio: 12.371x
shape_X=(1000, 1000), shape_kernel=(2, 2), density=0.1, n_nz=100000
conv walrus:    0.044s
conv broadcast: 0.017s.  Ratio: 2.634x
shape_X=(1000, 1000), shape_kernel=(5, 5), density=0.1, n_nz=100000
conv walrus:    0.158s
conv broadcast: 0.106s.  Ratio: 1.487x
shape_X=(1000, 1000), shape_kernel=(14, 14), density=0.1, n_nz=100000
conv walrus:    1.118s
conv broadcast: 0.954s.  Ratio: 1.172x
shape_X=(1000, 1000), shape_kernel=(25, 25), density=0.1, n_nz=100000
conv walrus:    3.500s
conv broadcast: 3.218s.  Ratio: 1.088x
shape_X=(1000, 1000), shape_kernel=(1, 1), density=1.0, n_nz=1000000
conv walrus:    0.113s
conv broadcast: 0.020s.  Ratio: 5.732x
shape_X=(1000, 1000), shape_kernel=(2, 2), density=1.0, n_nz=1000000
conv walrus:    0.282s
conv broadcast: 0.260s.  Ratio: 1.084x
shape_X=(1000, 1000), shape_kernel=(5, 5), density=1.0, n_nz=1000000
conv walrus:    1.372s
conv broadcast: 1.200s.  Ratio: 1.143x
shape_X=(100, 100000), shape_kernel=(1, 1), density=0.0001, n_nz=1000
conv walrus:    0.122s
conv broadcast: 0.000s.  Ratio: 277.041x
shape_X=(100, 100000), shape_kernel=(2, 2), density=0.0001, n_nz=1000
conv walrus:    0.119s
conv broadcast: 0.000s.  Ratio: 238.922x
shape_X=(100, 100000), shape_kernel=(5, 5), density=0.0001, n_nz=1000
conv walrus:    0.124s
conv broadcast: 0.001s.  Ratio: 103.710x
shape_X=(100, 100000), shape_kernel=(14, 14), density=0.0001, n_nz=1000
conv walrus:    0.142s
conv broadcast: 0.007s.  Ratio: 21.328x
shape_X=(100, 100000), shape_kernel=(25, 25), density=0.0001, n_nz=1000
conv walrus:    0.183s
conv broadcast: 0.022s.  Ratio: 8.410x
shape_X=(100, 100000), shape_kernel=(1, 1), density=0.001, n_nz=10000
conv walrus:    0.133s
conv broadcast: 0.001s.  Ratio: 258.307x
shape_X=(100, 100000), shape_kernel=(2, 2), density=0.001, n_nz=10000
conv walrus:    0.131s
conv broadcast: 0.002s.  Ratio: 64.405x
shape_X=(100, 100000), shape_kernel=(5, 5), density=0.001, n_nz=10000
conv walrus:    0.154s
conv broadcast: 0.010s.  Ratio: 15.616x
shape_X=(100, 100000), shape_kernel=(14, 14), density=0.001, n_nz=10000
conv walrus:    0.311s
conv broadcast: 0.076s.  Ratio: 4.077x
shape_X=(100, 100000), shape_kernel=(25, 25), density=0.001, n_nz=10000
conv walrus:    0.677s
conv broadcast: 0.292s.  Ratio: 2.320x
shape_X=(100, 100000), shape_kernel=(1, 1), density=0.01, n_nz=100000
conv walrus:    0.168s
conv broadcast: 0.002s.  Ratio: 87.391x
shape_X=(100, 100000), shape_kernel=(2, 2), density=0.01, n_nz=100000
conv walrus:    0.201s
conv broadcast: 0.022s.  Ratio: 9.273x
shape_X=(100, 100000), shape_kernel=(5, 5), density=0.01, n_nz=100000
conv walrus:    0.387s
conv broadcast: 0.105s.  Ratio: 3.671x
shape_X=(100, 100000), shape_kernel=(14, 14), density=0.01, n_nz=100000
conv walrus:    1.709s
conv broadcast: 1.074s.  Ratio: 1.592x
shape_X=(100, 100000), shape_kernel=(25, 25), density=0.01, n_nz=100000
conv walrus:    4.505s
conv broadcast: 3.565s.  Ratio: 1.264x
shape_X=(100, 100000), shape_kernel=(1, 1), density=0.1, n_nz=1000000
conv walrus:    0.291s
conv broadcast: 0.019s.  Ratio: 15.127x
shape_X=(100, 100000), shape_kernel=(2, 2), density=0.1, n_nz=1000000
conv walrus:    0.560s
conv broadcast: 0.284s.  Ratio: 1.974x
shape_X=(100, 100000), shape_kernel=(5, 5), density=0.1, n_nz=1000000
conv walrus:    1.956s
conv broadcast: 1.457s.  Ratio: 1.342x
shape_X=(100000, 100), shape_kernel=(1, 1), density=0.0001, n_nz=1000
conv walrus:    0.120s
conv broadcast: 0.001s.  Ratio: 163.141x
shape_X=(100000, 100), shape_kernel=(2, 2), density=0.0001, n_nz=1000
conv walrus:    0.116s
conv broadcast: 0.001s.  Ratio: 116.033x
shape_X=(100000, 100), shape_kernel=(5, 5), density=0.0001, n_nz=1000
conv walrus:    0.116s
conv broadcast: 0.001s.  Ratio: 85.461x
shape_X=(100000, 100), shape_kernel=(14, 14), density=0.0001, n_nz=1000
conv walrus:    0.133s
conv broadcast: 0.004s.  Ratio: 34.155x
shape_X=(100000, 100), shape_kernel=(25, 25), density=0.0001, n_nz=1000
conv walrus:    0.166s
conv broadcast: 0.012s.  Ratio: 13.588x
shape_X=(100000, 100), shape_kernel=(1, 1), density=0.001, n_nz=10000
conv walrus:    0.130s
conv broadcast: 0.001s.  Ratio: 152.669x
shape_X=(100000, 100), shape_kernel=(2, 2), density=0.001, n_nz=10000
conv walrus:    0.129s
conv broadcast: 0.002s.  Ratio: 64.218x
shape_X=(100000, 100), shape_kernel=(5, 5), density=0.001, n_nz=10000
conv walrus:    0.144s
conv broadcast: 0.006s.  Ratio: 24.593x
shape_X=(100000, 100), shape_kernel=(14, 14), density=0.001, n_nz=10000
conv walrus:    0.268s
conv broadcast: 0.045s.  Ratio: 5.990x
shape_X=(100000, 100), shape_kernel=(25, 25), density=0.001, n_nz=10000
conv walrus:    0.579s
conv broadcast: 0.170s.  Ratio: 3.401x
shape_X=(100000, 100), shape_kernel=(1, 1), density=0.01, n_nz=100000
conv walrus:    0.172s
conv broadcast: 0.003s.  Ratio: 62.468x
shape_X=(100000, 100), shape_kernel=(2, 2), density=0.01, n_nz=100000
conv walrus:    0.200s
conv broadcast: 0.015s.  Ratio: 13.783x
shape_X=(100000, 100), shape_kernel=(5, 5), density=0.01, n_nz=100000
conv walrus:    0.350s
conv broadcast: 0.071s.  Ratio: 4.901x
shape_X=(100000, 100), shape_kernel=(14, 14), density=0.01, n_nz=100000
conv walrus:    1.463s
conv broadcast: 0.815s.  Ratio: 1.795x
shape_X=(100000, 100), shape_kernel=(25, 25), density=0.01, n_nz=100000
conv walrus:    3.682s
conv broadcast: 2.673s.  Ratio: 1.378x
shape_X=(100000, 100), shape_kernel=(1, 1), density=0.1, n_nz=1000000
conv walrus:    0.283s
conv broadcast: 0.021s.  Ratio: 13.396x
shape_X=(100000, 100), shape_kernel=(2, 2), density=0.1, n_nz=1000000
conv walrus:    0.534s
conv broadcast: 0.149s.  Ratio: 3.591x
shape_X=(100000, 100), shape_kernel=(5, 5), density=0.1, n_nz=1000000
conv walrus:    1.838s
conv broadcast: 1.139s.  Ratio: 1.614x
shape_X=(5000, 5000), shape_kernel=(1, 1), density=0.0001, n_nz=2500
conv walrus:    0.300s
conv broadcast: 0.001s.  Ratio: 584.760x
shape_X=(5000, 5000), shape_kernel=(2, 2), density=0.0001, n_nz=2500
conv walrus:    0.296s
conv broadcast: 0.001s.  Ratio: 392.064x
shape_X=(5000, 5000), shape_kernel=(5, 5), density=0.0001, n_nz=2500
conv walrus:    0.301s
conv broadcast: 0.002s.  Ratio: 164.463x
shape_X=(5000, 5000), shape_kernel=(14, 14), density=0.0001, n_nz=2500
conv walrus:    0.328s
conv broadcast: 0.013s.  Ratio: 25.233x
shape_X=(5000, 5000), shape_kernel=(25, 25), density=0.0001, n_nz=2500
conv walrus:    0.410s
conv broadcast: 0.046s.  Ratio: 9.011x
shape_X=(5000, 5000), shape_kernel=(1, 1), density=0.001, n_nz=25000
conv walrus:    0.327s
conv broadcast: 0.001s.  Ratio: 378.012x
shape_X=(5000, 5000), shape_kernel=(2, 2), density=0.001, n_nz=25000
conv walrus:    0.332s
conv broadcast: 0.003s.  Ratio: 97.535x
shape_X=(5000, 5000), shape_kernel=(5, 5), density=0.001, n_nz=25000
conv walrus:    0.378s
conv broadcast: 0.021s.  Ratio: 18.150x
shape_X=(5000, 5000), shape_kernel=(14, 14), density=0.001, n_nz=25000
conv walrus:    0.748s
conv broadcast: 0.189s.  Ratio: 3.963x
shape_X=(5000, 5000), shape_kernel=(25, 25), density=0.001, n_nz=25000
conv walrus:    1.609s
conv broadcast: 0.725s.  Ratio: 2.220x
shape_X=(5000, 5000), shape_kernel=(1, 1), density=0.01, n_nz=250000
conv walrus:    0.416s
conv broadcast: 0.005s.  Ratio: 82.450x
shape_X=(5000, 5000), shape_kernel=(2, 2), density=0.01, n_nz=250000
conv walrus:    0.486s
conv broadcast: 0.038s.  Ratio: 12.716x
shape_X=(5000, 5000), shape_kernel=(5, 5), density=0.01, n_nz=250000
conv walrus:    0.924s
conv broadcast: 0.261s.  Ratio: 3.545x
shape_X=(5000, 5000), shape_kernel=(14, 14), density=0.01, n_nz=250000
conv walrus:    3.807s
conv broadcast: 2.482s.  Ratio: 1.534x
shape_X=(5000, 5000), shape_kernel=(25, 25), density=0.01, n_nz=250000
conv walrus:    9.817s
conv broadcast: 8.290s.  Ratio: 1.184x
shape_X=(30000, 30000), shape_kernel=(1, 1), density=0.0001, n_nz=90000
conv walrus:    10.411s
conv broadcast: 0.002s.  Ratio: 4825.228x
shape_X=(30000, 30000), shape_kernel=(2, 2), density=0.0001, n_nz=90000
conv walrus:    10.558s
conv broadcast: 0.011s.  Ratio: 954.405x
shape_X=(30000, 30000), shape_kernel=(5, 5), density=0.0001, n_nz=90000
conv walrus:    10.672s
conv broadcast: 0.072s.  Ratio: 149.161x
shape_X=(30000, 30000), shape_kernel=(14, 14), density=0.0001, n_nz=90000
conv walrus:    12.045s
conv broadcast: 0.699s.  Ratio: 17.229x
shape_X=(30000, 30000), shape_kernel=(25, 25), density=0.0001, n_nz=90000
conv walrus:    15.582s
conv broadcast: 2.447s.  Ratio: 6.369x
shape_X=(30000, 30000), shape_kernel=(1, 1), density=0.001, n_nz=900000
conv walrus:    11.478s
conv broadcast: 0.018s.  Ratio: 650.634x
shape_X=(30000, 30000), shape_kernel=(2, 2), density=0.001, n_nz=900000
conv walrus:    11.851s
conv broadcast: 0.136s.  Ratio: 87.376x
shape_X=(30000, 30000), shape_kernel=(5, 5), density=0.001, n_nz=900000
conv walrus:    13.745s
conv broadcast: 0.982s.  Ratio: 14.004x
shape_X=(30000, 30000), shape_kernel=(14, 14), density=0.001, n_nz=900000
conv walrus:    27.559s
conv broadcast: 8.121s.  Ratio: 3.394x
shape_X=(30000, 30000), shape_kernel=(25, 25), density=0.001, n_nz=900000
conv walrus:    60.050s
conv broadcast: 28.091s.  Ratio: 2.138x

And here are some plots:
image
image
image
image
image

As is, I'll happily reward the basic bounty reward for this work, and give partial credit on both the time complexity and testing suite portions: $50. The dings are due to the slower speed relative to code I provided, and the testing suite would need to be revised.
Send me your venmo or preferred payment method via email.

Let me know if you want to make any revisions before I start making edits.
Well done, your approach was insightful and your code is solid.

@GrandTheftWalrus
Copy link
Author

GrandTheftWalrus commented Nov 25, 2024

Hmm indeed the broadcasting does seem to be faster. It probably can't be beat because it's very simple and efficient. I'm outside right now so I can't take a gander for myself, but I think the minimal toeplitz method might at least be better suited for batching. If I'm not mistaken, the broadcasting method might run out of memory with significant batch sizes (and possibly take longer). If this is true (and you would prefer speed over batching ability) would it qualify for the full reward? If not, I'll take zee $50

@RichieHakim
Copy link
Owner

You are right that the Toeplitz approach scales sub O(batch_size). Unfortunately, a quick comparison shows that the old Toeplitz implementation scales better than the new one with increasing batch_size. I see a cross over in runtime with parameters:

shape_X = (75, 1500 * 1500)  ## (batch, height, width)
sparsity_X = 0.001
shape_kernel = (5, 5)

def fast_sparse_rand(m, n, density):
    rows = np.random.randint(0, m, int(m * n * density))
    cols = np.random.randint(0, n, int(m * n * density))
    vals = np.random.randn(int(m * n * density))
    return scipy.sparse.coo_matrix((vals, (rows, cols)), shape=(m, n)).tocsr()

X = fast_sparse_rand(shape_X[0], shape_X[1], sparsity_X)

k = np.ones((shape_kernel[0], shape_kernel[1]))

batches larger than 75 show the old implementation is faster.

I don't mean to shortchange you. Let's look again at the conditions:

  • $20 for any improvement allowing for parameter sharing in the double Toeplitz matrix: tiling and stitching, basic memory sharing within the sparse array.

Your implementation does not 'allow for parameter sharing' via 'tiling and stitching' or 'basic memory sharing'.

  • $50 for significant improvements in the time complexity for x_shape or k

Your implementation improves runtime over the code implemented in the repo under reasonable conditions, but it is slower than code I provided for low / single batch sizes, and will likely to be slower than the existing implementation for large batch sizes.

  • Additional $20-50 if the PR is comprehensive and I don't need to add anything to it (tests, documentation, etc.)

From what I have seen, I think you did a good job, but I have to make some edits.

My accounting is roughly max(0 * 20, 0.5 * 50) + (0.5 * 50). If you don't think I'm being fair, let me know. I think you did a great job.

@GrandTheftWalrus
Copy link
Author

@RichieHakim That sounds good to me 😎 I'll take the offer. I was mostly doing it for the challenge anyway rather than the reward.

@GrandTheftWalrus
Copy link
Author

@RichieHakim Do you know when you will update and close the Opire bounty

@RichieHakim
Copy link
Owner

Send me your venmo or preferred payment method via email.

...send me your venmo or preferred payment method via email. I need to make some changes to this PR before merging it and I haven't found the time to do that. Opire waits for the issue to be closed. The problem isn't really solved, so I left the issue open.

@OfforJohn
Copy link

Send me your venmo or preferred payment method via email.

...send me your venmo or preferred payment method via email. I need to make some changes to this PR before merging it and I haven't found the time to do that. Opire waits for the issue to be closed. The problem isn't really solved, so I left the issue open.

hello is this project still ongoing. to jump on and in the long term also

@nabby27
Copy link

nabby27 commented Dec 14, 2024

Hi @RichieHakim, I'm the co-founder of Opire! I've been keeping a close eye on the bounty you put on Opire and the issue.

I was thinking that you might be more interested in proposing this as a Challenge instead of a reward for solving the issue. Challenges are a new feature of Opire that we launched recently. Maybe it could be an "ongoing task" type Challenge and offer the incentive every time someone improves the code performance or something like that.

Anyway, I love seeing the community collaborate as they do on this issue, congratulations, good job everyone!

@OfforJohn
Copy link

Hi @RichieHakim, I'm the co-founder of Opire! I've been keeping a close eye on the bounty you put on Opire and the issue.

I was thinking that you might be more interested in proposing this as a Challenge instead of a reward for solving the issue. Challenges are a new feature of Opire that we launched recently. Maybe it could be an "ongoing task" type Challenge and offer the incentive every time someone improves the code performance or something like that.

Anyway, I love seeing the community collaborate as they do on this issue, congratulations, good job everyone!

Nicely said sir thanks for this opportunity from both sides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants