Skip to content

Refactor: Consolidate duplicate functions in Python codebase #13

@skywinder

Description

@skywinder

Summary

Code review identified several duplicate function implementations across the Python codebase that should be consolidated into shared modules.

True Duplicates (should be consolidated)

1. wav_to_array - Exact duplicate

  • Locations: whisper_server/server.py:49 and chunking.py:74
  • Description: Nearly identical code for converting WAV to numpy array
  • Recommendation: Extract to lib/audio.py

2. read_codec - Exact duplicate

  • Locations: whisper_server/server.py:67 and chunking.py:119
  • Description: Nearly identical code for reading audio codecs via ffmpeg
  • Recommendation: Extract to lib/audio.py

3. combine_chunks_to_wav - Similar implementation

  • Locations: diarization_worker.py:196 and play.py:103
  • Description: Both combine opus chunks into WAV with gap/silence handling
  • Recommendation: Extract common logic to shared module

4. mongo_cursor - Duplicate in playground

  • Locations: lib/worker.py:52 (shared utility) and playground.local.py:16
  • Recommendation: The playground version should import from lib/worker

Similar but different (may be intentional)

5. format_eta / _format_eta

  • stt.py:167 returns "02:15:30" format
  • diarization_worker.py:127 returns "0:15:30" (timedelta string)
  • Note: Different output formats - may be intentional

6. get_worker_id / _get_worker_id

  • lib/worker.py:47 returns "hostname_pid"
  • processors/vad.py:35 returns "py-vad:hostname:pid" (different prefix)
  • Note: Different formats to distinguish worker types - likely intentional

Proposed Solution

  1. Create lib/audio.py with shared audio functions (wav_to_array, read_codec, potentially combine_chunks_to_wav)
  2. Update whisper_server/server.py and chunking.py to import from lib/audio.py
  3. Fix playground.local.py to import mongo_cursor from lib/worker
  4. Consider consolidating format_eta functions if consistent output format is acceptable

Benefits

  • Reduced code duplication
  • Single source of truth for audio processing utilities
  • Easier maintenance and bug fixes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions