Skip to content

feat(tts): add Kokoro-82M TTS model implementation (#127)#165

Merged
m96-chan merged 3 commits intomainfrom
feature/issue-127-tts-kokoro
Dec 30, 2025
Merged

feat(tts): add Kokoro-82M TTS model implementation (#127)#165
m96-chan merged 3 commits intomainfrom
feature/issue-127-tts-kokoro

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

@m96-chan m96-chan commented Dec 30, 2025

Summary

  • Add pygpukit.tts module with Kokoro-82M StyleTTS2-based TTS model
  • Implement native LSTM CUDA kernel (Driver API compatible, no cudaMemcpy)
  • Add bidirectional LSTM support with forward/backward concatenation
  • Add SafeTensors/PTH weight loading with HuggingFace Hub support
  • Add G2P (grapheme-to-phoneme) tokenizer with misaki integration
  • Add WAV export/import utilities

LSTM Kernel Implementation

Features

  • Kernel-based memory copies - No cudaMemcpy, Driver API compatible
  • Stream support - Uses internal::get_capture_stream() for CUDA Graph compatibility
  • Bidirectional support - Forward + backward passes with concatenation kernel

Kernels Added

  • lstm_cell_f32_kernel - Single LSTM cell computation (gates, cell state, hidden)
  • copy_f32_kernel - Simple DtoD copy (replaces cudaMemcpy)
  • lstm_copy_to_output_f32_kernel - Strided copy for sequence output
  • lstm_concat_bidirectional_f32_kernel - Concatenation for bidirectional output

Python API

import pygpukit as pk

# Unidirectional LSTM
output, h_n, c_n = pk.lstm_forward(x, W_ih, W_hh, b_ih, b_hh)

# Bidirectional LSTM
output, h_n, c_n = pk.lstm_bidirectional(
    x,
    W_ih_fwd, W_hh_fwd, b_ih_fwd, b_hh_fwd,
    W_ih_bwd, W_hh_bwd, b_ih_bwd, b_hh_bwd,
)

Performance

RTX 5090, batch=8, seq_len=100, input=768, hidden=512:

  • 39.96 ms/forward
  • ~20,000 tokens/sec

TTS Architecture

  • PLBERT: 12-layer BERT encoder (768 hidden, 12 heads)
  • StyleEncoder: 128-dim style conditioning
  • Decoder: 3-layer conv decoder (512 hidden)
  • ISTFTNet: 60x upsample vocoder (24kHz output)

Files Added/Modified

  • native/ops/nn/recurrent/lstm_kernels.cuh - LSTM CUDA kernels
  • native/ops/nn/recurrent/lstm.inl - LSTM dispatch
  • native/bindings/nn/recurrent.cpp - Python bindings
  • src/pygpukit/ops/nn.py - lstm_forward, lstm_bidirectional
  • src/pygpukit/tts/kokoro/ - Kokoro model implementation
  • examples/tts.py - LSTM tests and benchmarks

Test plan

  • Unidirectional LSTM test passes
  • Bidirectional LSTM test passes
  • Output shape verification
  • Non-zero output validation
  • Ruff lint passes
  • Mypy type check passes

Closes #127

🤖 Generated with Claude Code

m96-chan and others added 2 commits December 30, 2025 22:24
Add text-to-speech module with Kokoro-82M StyleTTS2-based model:

- KokoroConfig: Model configuration dataclass with PLBERT/ISTFTNet params
- KokoroTokenizer: G2P conversion with misaki integration
- Neural network layers: Conv1d, BERT, StyleEncoder, Decoder, ISTFTNet
- Model loader: SafeTensors/PTH weight loading with HuggingFace Hub support
- KokoroModel: High-level API for text-to-speech synthesis
- Audio utilities: WAV export/import, resampling, concatenation

Architecture: PLBERT (12L, 768H) -> StyleEncoder (128D) -> Decoder (3L, 512H) -> ISTFTNet (60x upsample)
Output: 24kHz audio

Note: Forward pass is placeholder (sine wave output). Full inference requires
matching actual Kokoro weight structure from HuggingFace hexgrad/Kokoro-82M.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Implement LSTM forward kernel with kernel-based memory copies
  (Driver API compatible, no cudaMemcpy)
- Add bidirectional LSTM support with concatenation kernel
- Add Python API: lstm_forward(), lstm_bidirectional()
- Update TTS layers to use native LSTM kernel
- Add examples/tts.py with LSTM tests and benchmarks

Performance (RTX 5090, batch=8, seq=100, hidden=512):
- 39.96 ms/forward, ~20k tokens/sec

Closes #127

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan force-pushed the feature/issue-127-tts-kokoro branch from f482514 to 53bb16c Compare December 30, 2025 14:04
- Fix misaki G2P to properly iterate over generator of (grapheme, phoneme) tuples
- Handle Unicode IPA characters in console output (Windows cp932)
- Remove TODO placeholder, enable full TTS synthesis in examples/tts.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan merged commit a009d6c into main Dec 30, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(tts): Basic TTS model loading and inference

1 participant