A script to parse ffprobe output for converting bitmap subs via OCR to SRT.
The goal of this script is to enable conversion of bitmap subtitles, such as those found in rips of physical media, to SRT subtitles, using only ffmpeg and itself. Bitmap -> text subtitle conversion is already possible with other tools, such as VobSub2SRT or SubtitleEdit, which may offer more features, but this also introduces more dependencies and more build-time complexity. This script aims to be terminal-based and almost entirely standalone, making it easy to use inside e.g. Docker containers, with no need to install any additional libraries or packages; there is nothing to compile, no new dependencies to install, and it utilises standard Linux CLI features for IO. The only requirement is a version of ffmpeg compiled with Tesseract OCR support.
As stated in the introduction, the script itself has no other requirements. However, you must use a version of ffmpeg compiled with libtesseract enabled. Your distribution may or may not provide this; at the time of writing, the ffmpeg binary shipped in the Ubuntu repos does not include libtesseract support. If you need to compile ffmpeg yourself, please see the compilation guide on the ffmpeg wiki. To enable support for libtesseract, you will need to add --enable-libtesseract in the configure step.
The script is written to allow ffprobe to do the heavy lifting of extracting text. To utilise Tesseract OCR through ffprobe, first download the Tesseract data, e.g.
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddataWe can then run the OCR with ffprobe as
ffprobe -v error -show_entries 'frame=best_effort_timestamp_time:frame_tags=lavfi.ocr.text' -of compact=p=0:e=c:nk=1 -f lavfi -i "movie=/path/to/video,ocr=datapath=."Here, we assume that the media file at /path/to/video has subtitles baked into it (see Generating a subtitle video stream for how to handle video files with a separate dvdsub/vobsub stream).ffprobe will output a line for each frame of the input containing the timestamp in seconds of the frame, followed by a pipe character (|), followed by any text identified by the OCR in C-escaped format (i.e. line breaks are represented as \n, \ characters are represented as \\, etc., with the pipe character additionally represented as \| to distinguish it from the separator). We can pipe this into ocr2srt, and it will parse and convert this information to SRT, printing it on stdout. If you want to save this to an SRT file on disk, simply redirect it to file. The complete resultant command looks like this:
ffprobe -v error -show_entries 'frame=best_effort_timestamp_time:frame_tags=lavfi.ocr.text' -of compact=p=0:e=c:nk=1 -f lavfi -i "movie=/path/to/video,ocr=datapath=." | ocr2srt > subs.srtThe basic usage syntax above assumes that the subtitles are already baked into your video stream. If your video file instead includes a separate subtitle stream, we need to convert this into video stream for the OCR to operate on. To make it as easy as possible for the OCR to read the subtitles, we can bake them onto a blank background instead of overlaying them on the original video; this will usually produce better results, and is much more efficient. To do this, you can use the following command:
ffmpeg -i /path/to/video -filter_complex "color=color=black[c]; [c][0:s:0]scale2ref=oh*mdar:h=in_h,overlay=shortest=1[v]" -map "[v]" -an -vcodec libx264 subs.mkvThis will read the first subtitle stream from the video ([0:s:0] in the complex filter; you can adjust this to choose a different subtitle stream if multiple are available) and overlay it on a completely black background, sized to the native resolution of the bitmap subtitles. Here we then output to an H264 stream via libx264 for simplicity, but you can realistically use any output codec. We can then use the output file, subs.mkv as our input to ffprobe (see Basic Usage).
Alternatively, to avoid creating an intermediate temporary file, we can output this stream to stdout, and pipe it directly into ffprobe. The resultant command looks like
ffmpeg -i /path/to/video -filter_complex "color=color=black[c]; [c][0:s:0]scale2ref=oh*mdar:h=in_h,overlay=shortest=1[v]" -map "[v]" -an -vcodec libx264 -f matroska - | ffprobe -v error -show_entries 'frame=best_effort_timestamp_time:frame_tags=lavfi.ocr.text' -of compact=p=0:e=c:nk=1 -f lavfi -i "movie=/dev/stdin:f=matroska,ocr=datapath=."Note that the filepath given to ffprobe becomes /dev/stdin, and we additionally specify the stream format. This can then once again be piped into ocr2srt, allowing all-in-one subtitle conversion with a single command:
ffmpeg -i /path/to/video -filter_complex "color=color=black[c]; [c][0:s:0]scale2ref=oh*mdar:h=in_h,overlay=shortest=1[v]" -map "[v]" -an -vcodec libx264 -f matroska - | ffprobe -v error -show_entries 'frame=best_effort_timestamp_time:frame_tags=lavfi.ocr.text' -of compact=p=0:e=c:nk=1 -f lavfi -i "movie=/dev/stdin:f=matroska,ocr=datapath=." | ocr2srt > subs.srtThe script currently takes some liberties with how it parses and pre-processes the text from the OCR:
- It is assumed that subtitles are never going to contain a pipe character (hence why this was used as the delimiter). If a pipe character is encountered in the OCR text, this is assumed to be a misinterpreted sans-serif capital
I, and is automatically replaced. - The OCR output will often include newline characters, when the text should generally be interpreted as a single line. For this reason, newline characters are replaced with spaces.
- When multiple characters speak in a single subtitle, it is common to use a "bullet point" style subtitle. In this case, each character's line is prefixed with a
-character, and the two bullet points should be displayed on separate lines. If the first character in the OCR text is-, then the parser will attempt to preserve the bullet points; any newline characters immediately followed by a-character will be preserved, and all other newline characters replaced with spaces.