A voice-to-text transcription tool that automatically types your spoken words using OpenAI Whisper. This simple Python application is packaged using Nix for maximum ease of use and reproducible builds.
nix run github:emailnjv/whisper-inputOnce you run the command:
- Wait for a notification telling you to start speaking
- Start speaking clearly into your microphone
- Stop speaking when finished
- A notification will inform you that transcription is complete
- Your spoken text will be automatically typed in the currently focused text field
The application follows a simple pipeline architecture:
Audio Input → Recording → Silence Detection → Transcription → Text Output
-
Audio Recording Module (
record_speech)- Captures microphone input using PyAudio
- Implements real-time silence detection
- Saves audio to temporary WAV file
-
Transcription Engine (
transcribe_speech)- Uses OpenAI Whisper for speech-to-text conversion
- Loads the "base" model for balanced accuracy and performance
- Processes recorded audio files
-
Text Input Module (
type_text)- Simulates keyboard input using pynput
- Types transcribed text directly into active applications
-
User Interface (Notifications & Audio Feedback)
- Desktop notifications for status updates
- Optional audio beeps for workflow feedback
- Visual icons for different states
graph TD
A[Start Application] --> B[Play Start Beep]
B --> C[Show 'Start Speaking' Notification]
C --> D[Begin Audio Recording]
D --> E[Monitor Audio Levels]
E --> F{Silence Detected?}
F -->|No| E
F -->|Yes| G[Stop Recording]
G --> H[Save Audio to Temp File]
H --> I[Show 'Processing' Notification]
I --> J[Load Whisper Model]
J --> K[Transcribe Audio]
K --> L[Type Transcribed Text]
L --> M[Play End Beep]
M --> N[Show 'Complete' Notification]
N --> O[Clean Up & Exit]
The application integrates with OpenAI Whisper through the openai-whisper Python package:
-
Model Loading:
whisper.load_model("base")- Uses the "base" model (39 MB) for balanced performance
- Loaded once per session for efficiency
- Other available models: tiny, small, medium, large
-
Transcription:
model.transcribe(file_path)- Processes WAV audio files
- Returns structured result with transcribed text
- Handles various audio qualities and languages
-
PyAudio: Cross-platform audio I/O library
- Format: 16-bit PCM
- Sample Rate: 44,100 Hz
- Channels: Mono (1 channel)
- Buffer Size: 1024 frames
-
Silence Detection: Custom implementation using RMS (Root Mean Square)
- Threshold: 500 (configurable)
- Duration: 5-10 seconds (configurable)
# Ubuntu/Debian
sudo apt-get install portaudio19-dev python3-dev
# Fedora/RHEL
sudo dnf install portaudio-devel python3-devel
# Arch Linux
sudo pacman -S portaudiobrew install portaudioPyAudio wheels are available on PyPI for Windows.
The project includes a complete Nix flake for reproducible builds:
# Development shell
nix develop
# Direct execution
nix run github:emailnjv/whisper-input
# Build locally
nix buildIf not using Nix, install Python dependencies:
pip install openai-whisper pyaudio pynput plyer termcolor beepyNote: PyAudio installation may require system-level audio development libraries.
whisper-input [OPTIONS]
Options:
--silence_duration INT Duration of silence before stopping recording (seconds) [default: 5]
--beep Enable audio beep feedback at start and end
--help Show help message and exit# Use default 5-second silence duration
whisper-input
# Custom silence duration with beep feedback
whisper-input --silence_duration 3 --beep
# Longer silence duration for slower speech
whisper-input --silence_duration 10The application automatically detects:
- Microphone input device (uses system default)
- Temporary directory for audio files
- Desktop notification system
- Icon files location
- tiny: Fastest, least accurate (~39x real-time)
- base: Balanced performance (default, ~16x real-time)
- small: Better accuracy (~6x real-time)
- medium: High accuracy (~2x real-time)
- large: Best accuracy (~1x real-time)
- Base model: ~1 GB RAM during transcription
- Audio buffer: Minimal (~1 MB for typical recordings)
- Temporary files: Cleaned up automatically
- Recording: Real-time with < 100ms latency
- Transcription: Depends on model and audio length
- Base model: ~2-5 seconds for 10-second audio
- Text input: Near-instantaneous
-
No microphone input
- Check system audio permissions
- Verify microphone is not muted
- Test with other applications
-
PyAudio installation fails
- Install system audio development libraries
- Use Nix environment for automatic dependency management
-
Transcription accuracy issues
- Speak clearly and at moderate pace
- Reduce background noise
- Consider using a larger Whisper model
- Adjust silence duration for your speaking pattern
-
Text not typing in application
- Ensure target application is focused
- Check if application accepts programmatic input
- Some secure applications may block automated input
For debugging, the application provides colored terminal output:
- Yellow: Configuration warnings
- Green: Usage examples
- Default: Normal operation status
- Local Processing: All transcription happens locally using Whisper
- No Network Calls: No data sent to external services
- Temporary Files: Audio files are automatically cleaned up
- Permissions: Requires microphone and keyboard input access
The project uses Nix flakes for reproducible development environments:
# Enter development shell
nix develop
# Run locally
python src/whisper-input.py --help
# Test changes
nix buildThis project follows the same license as its dependencies. Please check individual package licenses for compliance requirements.