Description
All neural network layers in src/pygpukit/tts/kokoro/layers.py use CPU fallback implementations via numpy instead of native GPU kernels.
Affected Layers
| Layer |
Issue |
Conv1d |
im2col + numpy matmul |
ConvTranspose1d |
Scatter-add via Python loop |
BertSelfAttention |
numpy attention (no FlashAttention) |
ResBlock1d |
Uses Conv1d (CPU) |
ISTFTNet |
Overlap-add via Python loop |
leaky_relu |
numpy where() |
Example: Conv1d Implementation
def __call__(self, x: GPUArray) -> GPUArray:
# Convert to numpy for im2col (can be optimized later)
x_np = x.to_numpy()
w_np = self.weight.to_numpy()
# im2col: extract patches
for i in range(self.kernel_size):
for j in range(out_length):
col[:, :, i, j] = x_np[:, :, j_strided + i_dilated]
# Matmul
out_np = np.einsum("bkl,ok->bol", col, w_reshaped)
return from_numpy(out_np.astype(np.float32))
Impact
- Latency: Every layer incurs GPU->CPU->GPU transfer overhead
- Throughput: Python loops are orders of magnitude slower than CUDA
- Memory: Unnecessary copies between GPU and CPU memory
Note
This is a performance issue, not a functionality issue. The layers work correctly, but slowly.
However, see #179 - the main TTS issue is that model._forward_simple() doesn't call these layers at all (generates sine wave instead).
Required Work
- Implement native CUDA conv1d kernel
- Implement native CUDA transpose conv1d kernel
- Use existing SDPA for attention (already have causal, need bidirectional)
- Implement LeakyReLU kernel
- Implement ISTFT kernel (or use cuFFT)
Priority
Low - First fix #179 (make TTS functional), then optimize performance.
Related
Description
All neural network layers in
src/pygpukit/tts/kokoro/layers.pyuse CPU fallback implementations via numpy instead of native GPU kernels.Affected Layers
Conv1dConvTranspose1dBertSelfAttentionResBlock1dISTFTNetleaky_reluExample: Conv1d Implementation
Impact
Note
This is a performance issue, not a functionality issue. The layers work correctly, but slowly.
However, see #179 - the main TTS issue is that
model._forward_simple()doesn't call these layers at all (generates sine wave instead).Required Work
Priority
Low - First fix #179 (make TTS functional), then optimize performance.
Related