Skip to content

linear spectrogram encoder/decoder + image pipeline#4

Open
sankdoy wants to merge 2 commits intoBebra777228:mainfrom
sankdoy:main
Open

linear spectrogram encoder/decoder + image pipeline#4
sankdoy wants to merge 2 commits intoBebra777228:mainfrom
sankdoy:main

Conversation

@sankdoy
Copy link
Copy Markdown

@sankdoy sankdoy commented Jan 15, 2026

Screenshot 2026-01-15 at 14 06 28

switch to linear STFT for encoding/decoding, auto-fit text height, add image normalization/inversion + preview, add noise/flutter controls, fix UI scroll and prevent text clipping, improve spectrogram color rendering

Summary by CodeRabbit

  • New Features
    • Added customizable flutter and noise effects for text-to-audio conversion
    • Introduced image-to-audio encoding capability
    • Added audio-to-spectrogram decoding with preview visualization
    • Enhanced spectrogram rendering with improved visual quality
    • Expanded UI with new parameter controls for text rendering customization

✏️ Tip: You can customize this high-level summary in your review settings.

switch to linear STFT for encoding/decoding, auto-fit text height, add image normalisation/inversion + preview, add noise/flutter controls, fix UI scroll and prevent text clipping, improve spectrogram colour rendering
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 15, 2026

📝 Walkthrough

Walkthrough

The txt2spec.py module is significantly enhanced with font discovery, text measurement utilities, and image transformation helpers. The text-to-spectrogram-image pipeline gains configurable flutter and noise effects. Audio processing is extended with hop_length parameter and linear spectrogram generation. New end-to-end functions orchestrate image generation and audio synthesis, while the Gradio UI expands to expose rendering parameters.

Changes

Cohort / File(s) Summary
Font handling and text measurement
txt2spec.py
Introduces font discovery via FONT_PATHS and find_font_path, replaces single DEFAULT_FONT_PATH. Adds load_font with fallback support, measure_text for text dimensions, and fit_font_size to adapt font size to target height.
Image transformation utilities
txt2spec.py
Adds apply_flutter for flutter effects, add_text_noise for noise injection, spectrogram_to_rgb for colormap conversion, auto_invert_image for contrast handling, normalize_image for scaling, and prepare_image_spectrogram for dimension adjustment with minimum width enforcement.
Spectrogram rendering pipeline
txt2spec.py
Reworks text_to_spectrogram_image with safe margins, per-character random flutter, and proper centering. Extends spectrogram_image_to_audio with hop_length parameter and dual return (audio, n_fft). Adds save_linear_spectrogram for PNG preview generation.
End-to-end encoding/decoding
txt2spec.py
Introduces create_audio_with_spectrogram for orchestrated rendering with flutter/noise options, create_audio_from_image for image-to-audio encoding, and gradio interface functions (gradio_interface_fn, gradio_image_to_audio_fn, gradio_decode_fn) exposing new parameters.
Gradio UI expansion
txt2spec.py
Updates UI to expose flutter_px, noise_amount, and noise_strength sliders in Text→Audio tab. Adapts Image→Audio and Audio→Spectrogram flows to new backend utilities with CSS/theme refinements.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Gradio UI
    participant Text Pipeline
    participant Image Proc
    participant Audio Synth
    
    User->>Gradio UI: Input text + flutter/noise params
    Gradio UI->>Text Pipeline: gradio_interface_fn()
    Text Pipeline->>Text Pipeline: text_to_spectrogram_image()
    Text Pipeline->>Text Pipeline: fit_font_size() → measure_text()
    Text Pipeline->>Image Proc: load_font(), apply_flutter(), add_text_noise()
    Image Proc->>Image Proc: normalize_image(), prepare_image_spectrogram()
    Image Proc->>Text Pipeline: spectrogram image
    Text Pipeline->>Audio Synth: spectrogram_image_to_audio()
    Audio Synth->>Audio Synth: librosa conversion
    Audio Synth->>Gradio UI: audio + spectrogram preview
    Gradio UI->>User: Display audio & visualization
Loading
sequenceDiagram
    actor User
    participant Gradio UI
    participant Image Pipeline
    participant Audio Synth
    
    User->>Gradio UI: Upload image
    Gradio UI->>Image Pipeline: gradio_image_to_audio_fn()
    Image Pipeline->>Image Pipeline: prepare_image_spectrogram()
    Image Pipeline->>Image Pipeline: normalize_image() → spectrogram prep
    Image Pipeline->>Audio Synth: spectrogram_image_to_audio()
    Audio Synth->>Audio Synth: Convert to audio
    Audio Synth->>Gradio UI: audio output
    Gradio UI->>User: Play audio
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Text flutters and dances with noise so fine,
Spectrograms bloom in colors divine,
From images born, new audio flows,
A rabbit's code where creativity grows! 🎵

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: introducing a linear spectrogram encoder/decoder and an image processing pipeline, which are the primary focuses of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@txt2spec.py`:
- Around line 153-154: Replace the deprecated call cm.get_cmap(cmap_name) with
the supported API; using the existing import change it to
cm.colormaps.get_cmap(cmap_name) so the code becomes cmap =
cm.colormaps.get_cmap(cmap_name) before computing rgb from norm, leaving the
subsequent rgb = (cmap(norm)[:, :, :3] * 255).astype(np.uint8) unchanged.
🧹 Nitpick comments (8)
txt2spec.py (8)

11-27: Cross-platform font discovery is limited.

The font path list is heavily macOS-centric. Consider adding common Windows and additional Linux paths for broader compatibility:

 FONT_PATHS = [
     os.path.expanduser("~/Library/Fonts/Druk Wide.otf"),
     # ... existing macOS paths ...
     "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf",
+    "/usr/share/fonts/truetype/liberation/LiberationSans-Bold.ttf",
+    "/usr/share/fonts/TTF/DejaVuSans-Bold.ttf",
+    # Windows paths
+    os.path.expandvars(r"%WINDIR%\Fonts\arial.ttf"),
+    os.path.expandvars(r"%WINDIR%\Fonts\arialbd.ttf"),
 ]

53-59: Broad exception catch is acceptable but could be narrowed.

The catch-all Exception handles font loading failures gracefully. For better diagnostics, consider catching more specific exceptions like OSError or IOError which truetype() typically raises.


183-183: Unnecessary int() call.

In Python 3, round() with a single argument already returns an int, making the outer int() call redundant.

-    new_width = max(1, int(round(width * scale)))
+    new_width = max(1, round(width * scale))

231-238: Consider adding strict=True to zip().

Since text_widths is derived from text in fit_font_size, they should always have matching lengths. Adding strict=True (Python 3.10+) would catch any future bugs if this invariant breaks:

-    for char, char_width in zip(text, text_widths):
+    for char, char_width in zip(text, text_widths, strict=True):

245-250: Unused sr parameter is misleading.

The sr parameter is declared but never used in the function body. librosa.griffinlim doesn't require sample rate. This makes the API misleading since callers might expect sr to affect the reconstruction.

Consider either:

  1. Removing sr from the signature and documenting that sample rate is only needed when writing the audio file
  2. Adding a docstring clarifying that sr is unused but kept for API consistency
-def spectrogram_image_to_audio(image, sr=DEFAULT_SAMPLE_RATE, hop_length=DEFAULT_HOP_LENGTH):
+def spectrogram_image_to_audio(image, sr=DEFAULT_SAMPLE_RATE, hop_length=DEFAULT_HOP_LENGTH):  # noqa: ARG001
+    """Convert spectrogram image to audio. Note: sr is unused by Griffin-Lim but kept for API consistency."""

252-260: Unused sr parameter and potential temp file accumulation.

  1. The sr parameter is declared but unused (same issue as spectrogram_image_to_audio).

  2. Temporary files created with delete=False will accumulate over multiple runs. Consider implementing cleanup or documenting that callers are responsible for deletion.


286-292: Temp files accumulate across multiple pipeline runs.

Both create_audio_with_spectrogram and create_audio_from_image (and save_linear_spectrogram) create temp files with delete=False. In a long-running Gradio session, this could fill up temp storage.

Consider either:

  • Implementing periodic cleanup
  • Using a session-scoped temp directory that gets cleaned on app restart
  • Documenting that users should periodically clear temp files

360-361: Hidden sliders could be simplified.

These sliders are hidden (visible=False) but still passed through the interface. Consider removing them from the UI and passing the defaults directly in gradio_interface_fn if they're not meant to be user-configurable:

def gradio_interface_fn(text, max_font_size, margin, letter_spacing, flutter_px, noise_amount, noise_strength):
    return create_audio_with_spectrogram(
        text, DEFAULT_BASE_WIDTH, DEFAULT_SPEC_HEIGHT, max_font_size, ...
    )

This simplifies the UI component graph.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac8cd32 and cd151ac.

📒 Files selected for processing (1)
  • txt2spec.py
🧰 Additional context used
🪛 Ruff (0.14.11)
txt2spec.py

57-57: Do not catch blind exception: Exception

(BLE001)


180-180: Avoid specifying long messages outside the exception class

(TRY003)


183-183: Value being cast to int is already an integer

Remove unnecessary int call

(RUF046)


231-231: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


245-245: Unused function argument: sr

(ARG001)


252-252: Unused function argument: sr

(ARG001)


300-300: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (7)
txt2spec.py (7)

61-74: LGTM!

The text measurement approach correctly handles custom letter spacing by measuring characters individually. The empty text fallback to a space is a reasonable edge-case handler.


76-101: LGTM!

The binary search approach for font size fitting is efficient and handles edge cases well (no font path, nothing fits).


103-124: LGTM!

The flutter implementation with smoothed random offsets creates a natural wavy effect. The bounds checking correctly handles edge cases where shifts exceed image height.


126-137: LGTM!

The noise application correctly targets only text pixels (non-zero) and properly combines Gaussian noise with random dropout. The early return condition correctly requires both parameters to be disabled.


157-172: LGTM!

The auto-inversion heuristic and percentile-based normalization are robust approaches for handling varied input images. The 1st-99th percentile clipping effectively handles outliers.


313-341: LGTM!

Clean separation between Gradio interface wrappers and core logic. The thin wrapper pattern makes the core functions testable independently.


418-418: share=True creates a public URL.

Using share=True creates a publicly accessible URL through Gradio's sharing service. Ensure this is intentional, especially if deployed in production. For local-only use:

-txt2spec.launch(share=True)
+txt2spec.launch(share=False)  # or make configurable via environment variable

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant