Skip to content

humair-m/enc

Repository files navigation

Bitwav Encoder

A high-performance, GPU-optimized audio encoder package for Bitwav TTS. This package handles the conversion of raw audio waveforms into discrete tokens (via FSQ) and global style embeddings (SSL-based).

🚀 Features

  • Multiple Encoder Modes:
    • StandaloneEncoder: Base implementation for straightforward inference.
    • OptimizedStandaloneEncoder: Leverages torch.compile and mixed precision (FP16/BF16) for maximum throughput.
    • PipelinedEncoder: Optimized for overlapping chunk inference.
    • LongAudioEncoder: Handles extremely long audio (> 100s) using smart chunking and overlap strategies.
  • WavLM Integration: Built-in support for WavLM Base Plus as the SSL feature extractor.
  • Automatic Weights Management: Transparently handles local weights or automatic downloads from Hugging Face (humair025/enc).
  • Precision Verified: Negligible loss conversion (FP32 -> FP16) with 98.3%+ token parity.

📦 Installation

Option 1: Direct Pip Install (Recommended)

You can install the package directly from GitHub:

pip install git+https://github.com/humair-m/enc.git

Option 2: Clone and Install

git clone https://github.com/humair-m/enc.git
cd enc
pip install -e .

🛠 Usage

1. Audio Inputs

The encode method is flexible and accepts multiple input formats:

  • File Path: Just pass the string path to your .wav file.
  • Torch Tensor: A tensor of shape (channels, samples) or (samples,).
  • Numpy Array: A numpy array of the audio waveform.

2. Basic Example

from enc import create_standalone_encoder
import torchaudio

# 1. Initialize (Downloads weights automatically if needed)
encoder = create_standalone_encoder(device="cuda")

# 2. Encode from File Path
tokens, style = encoder.encode("path/to/audio.wav")

# 3. Encode from Tensor (e.g., already loaded)
waveform, sr = torchaudio.load("audio.wav")
tokens, style = encoder.encode(waveform, sr=sr)

High-Performance (Optimized)

from enc import create_optimized_encoder

encoder = create_optimized_encoder(
    device="cuda",
    compile_mode="max-autotune"
)

# First run will trigger torch.compile warmup
tokens, style_emb = encoder.encode(waveform, sr=24000)

Batch Processing (Pipelined)

For processing multiple files efficiently, use the PipelinedEncoder. It overlaps WavLM feature extraction with encoder processing using CUDA streams.

from enc import create_pipelined_encoder

encoder = create_pipelined_encoder(device="cuda")

# Supports list of file paths or tensors
files = ["test1.wav", "test2.wav", "test3.wav"]
results = encoder.encode_batch_pipelined(files, sr=24000, batch_size=4)

for tokens, style_emb in results:
    print(tokens.shape)

Long Audio Processing

from enc import create_long_audio_encoder

encoder = create_long_audio_encoder(device="cuda")
# Automatically handles chunking and overlap for long files
tokens, style_emb = encoder.encode("extremely_long_audio.wav")

📊 Precision & Performance

Detailed precision analysis is available in REPORT.md.

Metric Baseline (FP32) Optimized (FP16)
Model Size ~380MB ~190MB
Style Similarity 1.0 0.99999
Token Match 100% 98.35%

🔗 Related Repositories

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages