_______ _ _ _____ _ _____
|__ __| (_) | / ____| | |_ _|
| |_ __ __ _ _ __ ___ ___ _ __ _| |__ ___ _ __ ______| | | | | |
| | '__/ _` | '_ \/ __|/ __| '__| | '_ \ / _ \ '__|______| | | | | |
| | | | (_| | | | \__ \ (__| | | | |_) | __/ | | |____| |____ _| |_
|_|_| \__,_|_| |_|___/\___|_| |_|_.__/ \___|_| \_____|______|_____|
A command-line interface tool for transcribing herbarium label details from images using AWS Bedrock AI models.
Transcriber CLI is designed to process and transcribe text from herbarium specimen images:
- First Shot: Processes full images to extract label information
- Second Shot: Processes Images once more. (First shot results + Image) for another pass
- Python 3.x
- AWS account with Bedrock access
- Properly configured AWS credentials: AWS CLI
-
Clone this repository:
git clone https://github.com/rherbst123/Transcriber-CLI-V2 cd Transcriber_CLI -
- Create a virtual enviroment
python3 -m venv "Whatever you want to call the venv" -
Install required dependencies:
pip install -r requirements.txt
Run the main script:
python TranscribeCLI.py
The tool will:
- Ask you to name the run
- How many Runs to do
- Choose Model For First run
- Let Run complete
- Ask Model for second shot (If chosen)
- Output a completed .csv file to your desktop with a cost analysis and raw .json files
The tool includes comprehensive validation capabilities that can be configured in the settings menu:
- Validates scientific names against the Global Names Validator on Tropicos
- Flags potential issues with taxonomic nomenclature
- Provides suggested corrections when available
- Searches the Bryophyte Portal for existing records by catalog number
- Identifies potential duplicate specimens in the database
- Adds detailed duplicate information to CSV output
- Searches for existing entries based on collector, collection date, and collection ID
- Helps identify specimens that may already be in the database under different catalog numbers
- Useful for detecting field collection duplicates
All validation features can be enabled/disabled in the configuration menu and run automatically after transcription.
- Scientific Name Validation (Done with Global Names Validator on Tropicos) Global Names
- Search in Portal for Duplicate Catalog numbers
- Search in Portal for Entries on "Collector, record Number and Date"
- Automatic Segmentation for all images done before transcription
- Validation using GBIF, IDigBio and Sybiota
- Enhanced genus/species validation
The tool supports multiple AWS Bedrock models:
- Claude 3 Sonnet
- Claude 3.7 Sonnet
- Claude 4 Sonnet
- Claude 4.5 Sonnet
- Claude 4 Opus
- Claude 4.1 Opus
- LLama 3.2 90b
- LLama 4 17b
- Amazon Nova-lite,pro,premier
- Mistral Pixtral Large
Transcription results are saved in:
- Desktop/Finished_Transcriptions_"User Entered Run Name"
- Contents Include
- Segmented Images
- First and Second shot raw data
- First and Second shot .csv files
The tool uses specialized prompts for herbarium label transcription, located in the Prompts/ directory. The default prompt (Prompt_1.5.3.txt) is designed to extract detailed information from herbarium labels following specific formatting rules.
- Modify prompts in the
Prompts/directory to adjust transcription behavior - Add or remove models in the
AVAILABLE_MODELSlist in each transcriber module (This is updated frequently so you dont have to really)
Created by Riley Herbst, for the Field Museum. With much thanks to the following: Matt Von Konrat, Jeff Gwilliam, Dan Stille