TranscribeAI is an open source project that leverages Large Language Models (LLMs) to transcribe images from documents with exceptional accuracy. This project originated from a need to improve OCR transcription for the Minnesota Digital Library and was initially implemented using GPT-4o mini from our organization's instance to enhance accessibility for screen reader users and improve searchability of content that traditional OCR often fails to capture accurately. This current version uses the free alternative, Gemini Flash 2.0, offering a robust, cost-effective solution.
TranscribeAI can handle various document types—including typed text, handwritten text, tables, and mixed layouts—and it significantly outperforms traditional OCR solutions (e.g., Tesseract) in accuracy. One key factor behind its improved performance is the integration of contextual and metadata information about the document. By providing additional context, the model better interprets ambiguous or degraded text, distinguishes between similar characters, and generates more coherent, complete transcriptions.
- python-dotenv – For managing environment variables.
- TQDM – Provides progress bars in the terminal.
- Pillow – A fork of the Python Imaging Library (PIL) for image processing.
- google-generativeai – Used for accessing the Gemini API.
- whisper – For speech recognition.
- opencv-python – For computer vision and image processing tasks.
TranscribeAI/
├── GeminiImageTranscription/
│ ├── flash_process_local_dir.py # Gemini transcription script
│ ├── OcrDocumentContext/ # Context files for documents
│ │ ├── ALL_DOCUMENT_CONTEXT.txt # Global context for all images
│ │ ├── handwritten1_context.txt # Context for a specific image (handwritten1.jpeg)
│ │ └── handwritten2_context.txt # Context for a specific image (handwritten2.jpeg)
│ └── pending_image_paths_gemini.txt # Pending file for transcription
├── OptimizedImagesForOCR/ # Output folder for processed images
├── PreprocessPostprocess/
│ ├── preprocess_to_jpeg.py # Preprocessing script
│ └── optimized_image_path_for_ocr.txt # Tracking file for preprocessing
├── transcribe_document_main.py # Orchestration script
└── requirements.txt # Project dependencies
Directory Descriptions:
- GeminiImageTranscription: Contains the transcription script and supporting files. The
OcrDocumentContextfolder holds context files that inform the transcription prompt. - OptimizedImagesForOCR: Stores images that have been preprocessed and are ready for transcription.
- PreprocessPostprocess: Contains the preprocessing script and its tracking file, which manages the list of images to process.
- transcribe_document_main.py: Orchestrates the full pipeline from preprocessing to transcription.
Follow these steps to set up and run TranscribeAI on your local machine.
git clone https://github.com/Minitex/TranscribeAI.git
cd TranscribeAICreate and activate a Python virtual environment:
python -m venv myenv- On macOS/Linux:
source myenv/bin/activate - On Windows:
myenv\Scripts\activate
Install the required packages:
pip install -r requirements.txt-
Follow the instructions at Gemini API Key Docs to obtain your Gemini API key.
-
Create a local environment file by copying the example:
cp .env.example .env
-
Open the newly created
.envfile and replaceyour-api-keywith your actual Google API key:GOOGLE_API_KEY=your_actual_api_key_here
-
Source Images:
Place your source images (handwritten or typed documents) in a folder (e.g.,/path/to/input_images). -
Tracking File (Preprocessing):
The orchestration script automatically scans your input folder and populates the tracking file (PreprocessPostprocess/optimized_image_path_for_ocr.txt) with the full paths of the images. Running the orchestration script with the--newflag repopulates the tracking file. -
Context Files (Optional):
- To apply a global context to all images, place a file named
ALL_DOCUMENT_CONTEXT.txtin theGeminiImageTranscription/OcrDocumentContextfolder. - To provide individual context for specific images, create files named
<image_basename>_context.txt(e.g.,handwritten1_context.txt,handwritten2_context.txt) in the same folder. - When processing, the script logs which context is applied for each image.
- To apply a global context to all images, place a file named
Run the orchestration script to execute the entire pipeline:
python3 transcribe_document_main.py /path/to/input_images --newExample Walkthrough:
-
Source Images:
You have a folder/Users/Minitex/Downloads/TestImagescontaining images that you want to transcribe. -
Run the Pipeline:
Execute the following command:python3 transcribe_document_main.py /Users/Minitex/Downloads/TestImages --new
- The script scans
/Users/Minitex/Downloads/TestImagesand populates the tracking file inPreprocessPostprocess/optimized_image_path_for_ocr.txtwith the full paths of these images. - If a global context file exists in
GeminiImageTranscription/OcrDocumentContext/ALL_DOCUMENT_CONTEXT.txt, its content is applied to all images. - If individual context files exist (e.g.,
handwritten1_context.txt), they are appended to the global context for the corresponding image. - Preprocessing converts the images and saves the outputs in the
OptimizedImagesForOCRfolder. - Processed image paths are then gathered into the pending file for transcription in
GeminiImageTranscription/pending_image_paths_gemini.txt. - Finally, the Gemini transcription script is run, and upon successful transcription, all tracking files are removed.
- The script scans
- Transcriptions:
Processed transcriptions are saved as.txtfiles in theOptimizedImagesForOCRfolder. - Logs:
The terminal logs (viatqdm.write) provide details on which context (global, individual, or both) was applied to each image, along with any errors or retry information.
-
Transcription Prompt:
You can modify the prompt in thecompose_promptfunction within the orchestration script (transcribe_document_main.py) to suit your needs. -
Retry Settings:
Adjust theMAX_RETRIESandRETRY_DELAYconstants in the scripts to customize the retry behavior. -
Context Files:
-
Global Context:
To apply the same context to every document, create a file namedALL_DOCUMENT_CONTEXT.txtin theGeminiImageTranscription/OcrDocumentContextfolder. For example, if your documents share a common background—such as "a handwritten letter discussing family news and travel"—you can include that description in this file. The content of this file will be applied to all images during transcription. -
Individual Context:
For documents that require unique context, create separate files for each image. Each file should be named using the image’s basename with_context.txtappended. For example:- If you have an image named
handwritten1.jpeg, create a file namedhandwritten1_context.txtwith context specific to that document (e.g., "a personal letter written by Nellie McCluer to Mrs. Osborne discussing Easter celebrations"). - If you have an image named
invoice2022.png, create a file namedinvoice2022_context.txtwith context relevant to that invoice (e.g., "an invoice detailing services rendered in Q1 2022").
When processing, TranscribeAI will first check for a global context file. If it exists, its content is applied to all images. Then, for each image, the script checks for an individual context file that matches the image’s basename. If one is found, its content is appended to the global context for that image—providing tailored transcription instructions for images that need additional context.
- If you have an image named
-
Contributions are welcome! Please open issues or submit pull requests with improvements or fixes.