An intelligent Streamlit application that generates enhanced metadata for ContentDM digital collections using AI technologies including image captioning, OCR, and Named Entity Recognition with linked data.
- Integrated ContentDM Browser: 80/20 split layout with embedded ContentDM website
- Automatic Item Detection: Monitors URL changes to detect item detail pages (
/id/pattern) - AI-Powered Processing: Generates descriptions, extracts text, and identifies entities
- Linked Data Integration: Connects entities to Wikidata and DBpedia URIs
- Data Export System: Creates CSV files and JSON data packages following Frictionless Data standards
- Image Captioning: Uses BLIP model for automatic image description
- OCR Text Extraction: Tesseract-based text extraction from images
- Named Entity Recognition: spaCy NER with Wikidata/DBpedia linking
- Dublin Core Enhancement: Generates enhanced DC metadata fields
- Individual CSV files per processed item (named with record ID + timestamp)
- Collection-based folder organization
- JSON data package standard compliance
- ZIP export for collections with metadata and documentation
- Batch processing capabilities with progress tracking
- Python 3.9+
- Docker (optional but recommended)
- Tesseract OCR
- Git
-
Clone the repository
git clone https://github.com/your-username/contentdm-ai-generator.git cd contentdm-ai-generator -
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download spaCy model
python -m spacy download en_core_web_sm
-
Configure application
cp config.example.yaml config.yaml # Edit config.yaml with your settings -
Run the application
streamlit run app.py
-
Clone and configure
git clone https://github.com/your-username/contentdm-ai-generator.git cd contentdm-ai-generator cp config.example.yaml config.yaml -
Build and run with Docker Compose
docker-compose up -d
-
Access the application
- Open http://localhost:8501 in your browser
For production deployment with nginx reverse proxy:
docker-compose --profile production up -dcontentdm:
base_url: "https://vu.contentdm.oclc.org"
api_endpoint: "/digital/bl/dmwebservices/index.php"
default_collection: "vko"
timeout: 30
ai_models:
image_captioning:
model_name: "Salesforce/blip-image-captioning-base"
device: "auto" # auto, cpu, cuda
ner:
model: "en_core_web_sm"
enable_wikidata: true
enable_dbpedia: true
confidence_threshold: 0.7
export:
output_dir: "outputs"
csv_encoding: "utf-8"
zip_compression: trueKey environment variables for Docker deployment:
CONTENTDM_BASE_URL: Override ContentDM base URLAI_DEVICE: Force CPU/GPU usage (cpu,cuda,auto)OUTPUT_DIR: Custom output directory pathLOG_LEVEL: Logging level (DEBUG,INFO,WARNING,ERROR)
- Start the application and navigate to the ContentDM browser
- Browse to an item detail page (URL containing
/id/) - Load AI models using the "Load AI Models" button
- Generate metadata with "Auto Generate Additional Metadata"
- Review results in the expandable sections
- Save or export the enhanced metadata
- Manual URL Entry: Enter ContentDM URLs directly
- Quick Navigation: Use browse, search, and home buttons
- History: Access recently visited items
The application generates:
- Object Description: AI-generated image caption
- Text Transcription: OCR-extracted text content
- Named Entities: People, places, organizations with confidence scores
- Linked Data URIs: Wikidata and DBpedia links
- Enhanced Dublin Core: Improved DC metadata fields
-
Single Item Export
- CSV file with combined metadata
- JSON data package with schema
- README documentation
-
Collection Export
- Individual CSV files per item
- Combined collection CSV
- ZIP package with metadata
For processing entire collections:
- Navigate to any item in the target collection
- Click "Process Collection" to start batch processing
- Monitor progress in the processing log
- Use "Export All" to create collection package
dmGetItemInfo: Fetch item metadatadmGetImageInfo: Get image technical metadatadmGetFile: Download image filesdmQuery: Search and browse collectionsdmGetCollectionInfo: Collection metadata
The application detects ContentDM item pages from URLs like:
https://site.contentdm.oclc.org/digital/collection/alias/id/123https://site.contentdm.oclc.org/alias/id/123- Query parameter formats with
collection=andid=
- Model: Salesforce BLIP (Bootstrapping Language-Image Pre-training)
- Purpose: Generate descriptive captions for images
- Output: Natural language descriptions of visual content
- Engine: Tesseract OCR
- Languages: Configurable (default: English)
- Preprocessing: Denoising, contrast enhancement, binarization
- Model: spaCy English model (
en_core_web_sm) - Entities: PERSON, ORG, GPE, LOC, DATE, etc.
- Linking: Automatic linking to Wikidata and DBpedia
- GPU Support: Automatic CUDA detection for image processing
- Caching: Model and result caching for improved performance
- Batch Processing: Efficient processing of multiple items
Each processed item generates a CSV with these field categories:
original_*: Fields from ContentDM APIai_*: AI-generated content and analysisdc_*: Enhanced Dublin Core fieldsprocessing_*: Metadata about the processing
Follows the Frictionless Data standard:
datapackage.json: Package descriptor with schema- Table schema definitions for CSV files
- Resource metadata and relationships
- Contributor and source information
- Wikidata URIs:
http://www.wikidata.org/entity/Q* - DBpedia URIs:
http://dbpedia.org/resource/* - SPARQL Queries: Automatic entity resolution
- Confidence Scoring: Quality metrics for linked entities
-
AI Models Won't Load
# Check available memory and disk space df -h free -h # For CUDA issues: nvidia-smi
-
ContentDM API Timeouts
# Increase timeout in config.yaml contentdm: timeout: 60 max_retries: 5
-
OCR Not Working
# Install/reinstall Tesseract sudo apt-get install tesseract-ocr tesseract-ocr-eng -
Permission Errors
# Fix output directory permissions sudo chown -R $USER:$USER outputs/ chmod -R 755 outputs/
- Memory Usage: Use CPU-only mode for limited RAM:
device: "cpu" - Processing Speed: Reduce batch size:
batch_size: 5 - Storage: Enable cleanup of old files: Set retention policies
Check application logs for detailed error information:
tail -f contentdm_ai.logFor Docker deployments:
docker-compose logs -f contentdm-ai- All processing occurs locally or in your controlled environment
- No data sent to external AI services
- ContentDM API calls only to configured endpoints
- Configure firewall rules for production deployment
- Use HTTPS in production (nginx configuration provided)
- Implement authentication if needed (external auth proxy)
- iframe restrictions apply to ContentDM embedding
- CORS and XSRF protections configurable
- Input validation for all user-provided URLs
- Fork the repository and clone your fork
- Create development branch:
git checkout -b feature/your-feature - Install dev dependencies:
pip install -r requirements-dev.txt - Make changes and test thoroughly
- Submit pull request with detailed description
- Follow PEP 8 Python style guidelines
- Add type hints for new functions
- Include docstrings for public methods
- Write tests for new functionality
# Run tests
python -m pytest tests/
# Run with coverage
python -m pytest --cov=src tests/This project is licensed under the MIT License - see the LICENSE file for details.
If you use this software in research or academic work, please cite:
@software{contentdm_ai_generator,
title = {ContentDM AI Metadata Generator},
author = {Vanderfeesten, Maurice},
year = {2024},
url = {https://github.com/your-username/contentdm-ai-generator},
doi = {10.5281/zenodo.XXXXXX}
}- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: your-email@domain.com
For institutional deployments, training, or custom development:
- Contact: Maurice Vanderfeesten (ORCID: 0000-0001-6397-4759)
- Commercial support and consulting available
ContentDM AI Metadata Generator - Enhancing digital cultural heritage with artificial intelligence.