linear spectrogram encoder/decoder + image pipeline#4
linear spectrogram encoder/decoder + image pipeline#4sankdoy wants to merge 2 commits intoBebra777228:mainfrom
Conversation
switch to linear STFT for encoding/decoding, auto-fit text height, add image normalisation/inversion + preview, add noise/flutter controls, fix UI scroll and prevent text clipping, improve spectrogram colour rendering
…-+-image-pipeline feat/encoder-ui-fixes
📝 WalkthroughWalkthroughThe txt2spec.py module is significantly enhanced with font discovery, text measurement utilities, and image transformation helpers. The text-to-spectrogram-image pipeline gains configurable flutter and noise effects. Audio processing is extended with hop_length parameter and linear spectrogram generation. New end-to-end functions orchestrate image generation and audio synthesis, while the Gradio UI expands to expose rendering parameters. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Gradio UI
participant Text Pipeline
participant Image Proc
participant Audio Synth
User->>Gradio UI: Input text + flutter/noise params
Gradio UI->>Text Pipeline: gradio_interface_fn()
Text Pipeline->>Text Pipeline: text_to_spectrogram_image()
Text Pipeline->>Text Pipeline: fit_font_size() → measure_text()
Text Pipeline->>Image Proc: load_font(), apply_flutter(), add_text_noise()
Image Proc->>Image Proc: normalize_image(), prepare_image_spectrogram()
Image Proc->>Text Pipeline: spectrogram image
Text Pipeline->>Audio Synth: spectrogram_image_to_audio()
Audio Synth->>Audio Synth: librosa conversion
Audio Synth->>Gradio UI: audio + spectrogram preview
Gradio UI->>User: Display audio & visualization
sequenceDiagram
actor User
participant Gradio UI
participant Image Pipeline
participant Audio Synth
User->>Gradio UI: Upload image
Gradio UI->>Image Pipeline: gradio_image_to_audio_fn()
Image Pipeline->>Image Pipeline: prepare_image_spectrogram()
Image Pipeline->>Image Pipeline: normalize_image() → spectrogram prep
Image Pipeline->>Audio Synth: spectrogram_image_to_audio()
Audio Synth->>Audio Synth: Convert to audio
Audio Synth->>Gradio UI: audio output
Gradio UI->>User: Play audio
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@txt2spec.py`:
- Around line 153-154: Replace the deprecated call cm.get_cmap(cmap_name) with
the supported API; using the existing import change it to
cm.colormaps.get_cmap(cmap_name) so the code becomes cmap =
cm.colormaps.get_cmap(cmap_name) before computing rgb from norm, leaving the
subsequent rgb = (cmap(norm)[:, :, :3] * 255).astype(np.uint8) unchanged.
🧹 Nitpick comments (8)
txt2spec.py (8)
11-27: Cross-platform font discovery is limited.The font path list is heavily macOS-centric. Consider adding common Windows and additional Linux paths for broader compatibility:
FONT_PATHS = [ os.path.expanduser("~/Library/Fonts/Druk Wide.otf"), # ... existing macOS paths ... "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", + "/usr/share/fonts/truetype/liberation/LiberationSans-Bold.ttf", + "/usr/share/fonts/TTF/DejaVuSans-Bold.ttf", + # Windows paths + os.path.expandvars(r"%WINDIR%\Fonts\arial.ttf"), + os.path.expandvars(r"%WINDIR%\Fonts\arialbd.ttf"), ]
53-59: Broad exception catch is acceptable but could be narrowed.The catch-all
Exceptionhandles font loading failures gracefully. For better diagnostics, consider catching more specific exceptions likeOSErrororIOErrorwhichtruetype()typically raises.
183-183: Unnecessaryint()call.In Python 3,
round()with a single argument already returns anint, making the outerint()call redundant.- new_width = max(1, int(round(width * scale))) + new_width = max(1, round(width * scale))
231-238: Consider addingstrict=Truetozip().Since
text_widthsis derived fromtextinfit_font_size, they should always have matching lengths. Addingstrict=True(Python 3.10+) would catch any future bugs if this invariant breaks:- for char, char_width in zip(text, text_widths): + for char, char_width in zip(text, text_widths, strict=True):
245-250: Unusedsrparameter is misleading.The
srparameter is declared but never used in the function body.librosa.griffinlimdoesn't require sample rate. This makes the API misleading since callers might expectsrto affect the reconstruction.Consider either:
- Removing
srfrom the signature and documenting that sample rate is only needed when writing the audio file- Adding a docstring clarifying that
sris unused but kept for API consistency-def spectrogram_image_to_audio(image, sr=DEFAULT_SAMPLE_RATE, hop_length=DEFAULT_HOP_LENGTH): +def spectrogram_image_to_audio(image, sr=DEFAULT_SAMPLE_RATE, hop_length=DEFAULT_HOP_LENGTH): # noqa: ARG001 + """Convert spectrogram image to audio. Note: sr is unused by Griffin-Lim but kept for API consistency."""
252-260: Unusedsrparameter and potential temp file accumulation.
The
srparameter is declared but unused (same issue asspectrogram_image_to_audio).Temporary files created with
delete=Falsewill accumulate over multiple runs. Consider implementing cleanup or documenting that callers are responsible for deletion.
286-292: Temp files accumulate across multiple pipeline runs.Both
create_audio_with_spectrogramandcreate_audio_from_image(andsave_linear_spectrogram) create temp files withdelete=False. In a long-running Gradio session, this could fill up temp storage.Consider either:
- Implementing periodic cleanup
- Using a session-scoped temp directory that gets cleaned on app restart
- Documenting that users should periodically clear temp files
360-361: Hidden sliders could be simplified.These sliders are hidden (
visible=False) but still passed through the interface. Consider removing them from the UI and passing the defaults directly ingradio_interface_fnif they're not meant to be user-configurable:def gradio_interface_fn(text, max_font_size, margin, letter_spacing, flutter_px, noise_amount, noise_strength): return create_audio_with_spectrogram( text, DEFAULT_BASE_WIDTH, DEFAULT_SPEC_HEIGHT, max_font_size, ... )This simplifies the UI component graph.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
txt2spec.py
🧰 Additional context used
🪛 Ruff (0.14.11)
txt2spec.py
57-57: Do not catch blind exception: Exception
(BLE001)
180-180: Avoid specifying long messages outside the exception class
(TRY003)
183-183: Value being cast to int is already an integer
Remove unnecessary int call
(RUF046)
231-231: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
245-245: Unused function argument: sr
(ARG001)
252-252: Unused function argument: sr
(ARG001)
300-300: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (7)
txt2spec.py (7)
61-74: LGTM!The text measurement approach correctly handles custom letter spacing by measuring characters individually. The empty text fallback to a space is a reasonable edge-case handler.
76-101: LGTM!The binary search approach for font size fitting is efficient and handles edge cases well (no font path, nothing fits).
103-124: LGTM!The flutter implementation with smoothed random offsets creates a natural wavy effect. The bounds checking correctly handles edge cases where shifts exceed image height.
126-137: LGTM!The noise application correctly targets only text pixels (non-zero) and properly combines Gaussian noise with random dropout. The early return condition correctly requires both parameters to be disabled.
157-172: LGTM!The auto-inversion heuristic and percentile-based normalization are robust approaches for handling varied input images. The 1st-99th percentile clipping effectively handles outliers.
313-341: LGTM!Clean separation between Gradio interface wrappers and core logic. The thin wrapper pattern makes the core functions testable independently.
418-418:share=Truecreates a public URL.Using
share=Truecreates a publicly accessible URL through Gradio's sharing service. Ensure this is intentional, especially if deployed in production. For local-only use:-txt2spec.launch(share=True) +txt2spec.launch(share=False) # or make configurable via environment variable
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
switch to linear STFT for encoding/decoding, auto-fit text height, add image normalization/inversion + preview, add noise/flutter controls, fix UI scroll and prevent text clipping, improve spectrogram color rendering
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.