Skip to content

Arthur-Ficial/ohr

Repository files navigation

ohr

Version 0.1.3 Swift 6.3+ macOS 26+ No Xcode Required License: MIT 100% On-Device

On-device speech-to-text on your Mac. Transcribe audio files, generate subtitles, stream from the microphone — all locally, no cloud.

No API keys. No network. No subscriptions. The speech recognition is already on your computer — ohr lets you use it.

What is this

Every Mac with Apple Silicon has a built-in speech recognizer — Apple's on-device SpeechAnalyzer, shipped as part of the Speech framework (macOS 26+). ohr wraps it in a CLI and an OpenAI-compatible HTTP server — so you can actually use it. All inference runs on-device, no network calls.

  • UNIX toolohr meeting.m4a — file in, text out. Pipe-friendly, multiple output formats, proper exit codes
  • Subtitle generatorohr -o srt lecture.wav > lecture.srt — SRT and VTT with precise timestamps
  • Live transcriptionohr --listen — real-time microphone input
  • OpenAI-compatible serverohr --serve — drop-in replacement for POST /v1/audio/transcriptions
  • Zero cost — no API keys, no cloud, no subscriptions, 30 languages supported

Requirements & Install

  • Apple Silicon Mac, macOS 26 Tahoe or newer
  • Building from source requires Command Line Tools with macOS 26 SDK (ships Swift 6.3). No Xcode required.

Homebrew (recommended):

brew tap Arthur-Ficial/tap
brew install Arthur-Ficial/tap/ohr

Build from source:

git clone https://github.com/Arthur-Ficial/ohr.git
cd ohr
make install

Quick Start

Transcribe a file

ohr meeting.m4a

Generate subtitles

# SRT format
ohr -o srt lecture.wav > lecture.srt

# WebVTT format
ohr --vtt interview.m4a > interview.vtt

JSON output with segments

ohr -o json recording.m4a | jq .
{
  "model": "apple-speechanalyzer",
  "text": "Hello, this is a test of the speech to text system.",
  "segments": [
    { "id": 0, "start": 0.0, "end": 1.86, "text": "Hello, this is a test" },
    { "id": 1, "start": 1.86, "end": 4.44, "text": "of the speech to text system." }
  ],
  "duration": 4.44,
  "language": "en",
  "metadata": { "on_device": true, "version": "0.1.3" }
}

Timestamps in plain text

ohr --timestamps meeting.m4a
[00:00:00,000] Hello, this is a test
[00:00:01,860] of the speech to text system.

Pipe from stdin

cat recording.wav | ohr

Pipe to apfel for summarization

ohr meeting.m4a | apfel "summarize this meeting"

Live microphone transcription

ohr --listen                    # plain text with timestamps
ohr --listen --json             # JSONL stream
ohr --listen --srt              # SRT as you speak

Select language

ohr -l de-DE meeting.m4a       # German
ohr -l fr-FR interview.m4a     # French
ohr -l ja-JP recording.m4a     # Japanese

OpenAI-compatible server

# Start server
ohr --serve

# In another terminal:
curl -X POST http://localhost:11434/v1/audio/transcriptions \
  -F file=@meeting.m4a \
  -F model=apple-speechanalyzer
{"text": "Hello, this is a test of the speech to text system."}

All five response formats:

# JSON (default)
curl -X POST http://localhost:11434/v1/audio/transcriptions \
  -F file=@audio.m4a -F response_format=json

# Verbose JSON (with segments, timestamps, duration)
curl -X POST http://localhost:11434/v1/audio/transcriptions \
  -F file=@audio.m4a -F response_format=verbose_json

# Plain text
curl -X POST http://localhost:11434/v1/audio/transcriptions \
  -F file=@audio.m4a -F response_format=text

# SRT subtitles
curl -X POST http://localhost:11434/v1/audio/transcriptions \
  -F file=@audio.m4a -F response_format=srt

# WebVTT subtitles
curl -X POST http://localhost:11434/v1/audio/transcriptions \
  -F file=@audio.m4a -F response_format=vtt

Demos

See demo/ for real-world shell scripts powered by ohr.

subtitle — generate subtitles:

demo/subtitle lecture.m4a --save           # saves lecture.srt next to file

audio-grep — search inside audio files:

demo/audio-grep "budget" meetings/*.m4a    # find mentions with timestamps
demo/audio-grep -c "deadline" *.m4a        # count matches per file

minutes — meeting to minutes (ohr + apfel):

demo/minutes standup.m4a -o markdown > standup.md

batch-transcribe — transcribe a whole folder:

demo/batch-transcribe ~/recordings/ -o srt

whisper-compat — drop-in Whisper CLI replacement:

demo/whisper-compat audio.m4a --output_format srt --language en

Also in demo/:

OpenAI API Compatibility

Base URL: http://localhost:11434/v1

Feature Status Notes
POST /v1/audio/transcriptions Supported All 5 response formats
GET /v1/models Supported Returns apple-speechanalyzer
GET /health Supported Model availability, formats, languages
GET /v1/logs Debug only Available with --debug
GET /v1/logs/stats Debug only Available with --debug
response_format Supported json, verbose_json, text, srt, vtt
language Supported BCP-47 language code
model Accepted Ignored (only one model)
prompt Accepted Ignored (SpeechAnalyzer doesn't support prompting)
temperature Accepted Validated (0.0–1.0)
CORS Supported Enable with --cors
Token auth Supported --token <secret> or --token-auto
POST /v1/chat/completions 501 Use apfel
POST /v1/embeddings 501 Use kern

Supported Formats

Format Extensions
Apple M4A .m4a
WAV .wav, .wave
MP3 .mp3
MPEG-4 .mp4
Core Audio .caf
AIFF .aiff, .aif
FLAC .flac

Supported Languages

30 languages including English, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Chinese (Simplified, Traditional, Cantonese).

ohr --model-info    # full list

Performance

Tested on Apple M2 with synthetic speech. Real-world performance may vary.

Audio Length Transcribe Time Speed
5 seconds 300ms 8x real-time
30 seconds 600ms 39x real-time
1 minute 1.5s 46x real-time
3 minutes 2.5s 58x real-time
10 minutes 4.7s 57x real-time

10 minutes of audio transcribes in under 5 seconds. No upper limit found.

Limitations

Constraint Detail
Platform macOS 26+, Apple Silicon only
Model One model (apple-speechanalyzer), not configurable
Accuracy ~90-95% on clear synthetic speech. Lower on real speech with noise, accents, or multiple speakers
Numbers Spoken numbers sometimes confused with ordinals ("five second" → "52nd")
Languages 30 languages supported, but accuracy tested only for English. Other languages need the -l flag
No diarization Cannot distinguish between different speakers
Audio formats m4a, wav, mp3, mp4, caf, aiff, flac. No OGG, OPUS, or WebM
apfel integration When piping to apfel, transcripts longer than ~3000 words may exceed apfel's 4096-token context window
Testing gap All testing used synthetic say command speech, not real human recordings

See docs/testing.md for the full QA report with methodology and detailed results.

CLI Reference

ohr <file>                     Transcribe audio file
ohr -o srt <file>              Generate SRT subtitles
ohr -o vtt <file>              Generate VTT subtitles
ohr -o json <file>             JSON output with segments
ohr --listen                   Live microphone transcription
ohr --serve                    Start OpenAI-compatible server
cat audio.wav | ohr            Transcribe from stdin

Output options:

Flag Description
-o, --output <fmt> Output format: plain (default), json, srt, vtt
--json Shorthand for -o json
--srt Shorthand for -o srt
--vtt Shorthand for -o vtt
--timestamps Show timestamps in plain text output
-l, --language <code> Language code (e.g. en-US, de-DE)
-q, --quiet Suppress headers and chrome
--no-color Disable ANSI colors

Server options (--serve):

Flag Description
--port <n> Server port (default: 11434)
--host <addr> Bind address (default: 127.0.0.1)
--cors Enable CORS headers for browser clients
--allowed-origins <origins> Add comma-separated allowed origins
--no-origin-check Disable origin checking
--token <secret> Require Bearer token authentication
--token-auto Generate and print a random Bearer token
--public-health Keep /health unauthenticated on non-loopback
--footgun Disable all protections
--max-concurrent <n> Max concurrent requests (default: 5)
--debug Verbose logging and enable /v1/logs endpoints

Info options:

Flag Description
-v, --version Print version
-h, --help Show help
--release Show detailed build info
--model-info Show model capabilities and languages

Exit Codes

Code Meaning
0 Success
1 Runtime error
2 Usage error (bad flags)
3 Unsupported audio format
4 File not found
5 Transcription failed
6 Rate limited

Environment Variables

Variable Description
OHR_PORT Server port (default: 11434)
OHR_HOST Server bind address (default: 127.0.0.1)
OHR_TOKEN Bearer token for server authentication
OHR_LANGUAGE Default language code
NO_COLOR Disable colors (no-color.org)

Architecture

CLI (file/stdin/mic) ──┐
                       ├──→ Speech.SpeechAnalyzer (file transcription)
                       ├──→ Speech.SpeechTranscriber (live microphone)
HTTP Server (/v1/*) ───┘    (100% on-device, zero network)

Built with Swift 6.3 strict concurrency. Single Package.swift, three targets:

  • OhrCore — pure logic library (no Speech framework dependency, unit-testable)
  • ohr — executable (CLI + server)
  • ohr-tests — 109 unit tests

No Xcode required. Builds and tests with Command Line Tools only.

Build & Test

# Build + install (auto-bumps patch version each time)
make install                             # build release + install to /usr/local/bin
make build                               # build release only (no install)

# Version management (zero manual editing)
make version                             # print current version
make release-minor                       # bump minor: 0.1.x -> 0.2.0
make release-major                       # bump major: 0.x.y -> 1.0.0

# Debug build (no version bump, uses swift directly)
swift build                              # quick debug build

# Unit tests
swift run ohr-tests                      # 109 pure Swift unit tests (no XCTest needed)

# Integration tests (requires server running)
ohr --serve --token test --debug &       # start server
OHR_TEST_TOKEN=test python3 -m pytest Tests/integration/ -v  # 42 integration tests

Every make build/make install automatically:

  • Bumps the patch version (.version file is the single source of truth)
  • Generates build metadata (commit, date, Swift version) viewable via ohr --release

Part of the apfel ecosystem

Tool What Apple Framework Repo
apfel LLM (text generation) FoundationModels golden example
ohr Speech-to-text SpeechAnalyzer you are here
kern Text embeddings NLContextualEmbedding sister project
auge Vision / OCR Vision sister project

Meta-repo: apfel-ecosystem

License

MIT

About

On-device speech-to-text CLI and OpenAI-compatible server - Apple SpeechAnalyzer from the command line

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors