Skip to content

ubvu/contentdm_ai_metadata_generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ContentDM AI Metadata Generator

An intelligent Streamlit application that generates enhanced metadata for ContentDM digital collections using AI technologies including image captioning, OCR, and Named Entity Recognition with linked data.

πŸš€ Features

Core Functionality

  • Integrated ContentDM Browser: 80/20 split layout with embedded ContentDM website
  • Automatic Item Detection: Monitors URL changes to detect item detail pages (/id/ pattern)
  • AI-Powered Processing: Generates descriptions, extracts text, and identifies entities
  • Linked Data Integration: Connects entities to Wikidata and DBpedia URIs
  • Data Export System: Creates CSV files and JSON data packages following Frictionless Data standards

AI Processing Pipeline

  1. Image Captioning: Uses BLIP model for automatic image description
  2. OCR Text Extraction: Tesseract-based text extraction from images
  3. Named Entity Recognition: spaCy NER with Wikidata/DBpedia linking
  4. Dublin Core Enhancement: Generates enhanced DC metadata fields

Data Management

  • Individual CSV files per processed item (named with record ID + timestamp)
  • Collection-based folder organization
  • JSON data package standard compliance
  • ZIP export for collections with metadata and documentation
  • Batch processing capabilities with progress tracking

πŸ“¦ Installation

Prerequisites

  • Python 3.9+
  • Docker (optional but recommended)
  • Tesseract OCR
  • Git

Local Installation

  1. Clone the repository

    git clone https://github.com/your-username/contentdm-ai-generator.git
    cd contentdm-ai-generator
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Download spaCy model

    python -m spacy download en_core_web_sm
  5. Configure application

    cp config.example.yaml config.yaml
    # Edit config.yaml with your settings
  6. Run the application

    streamlit run app.py

Docker Installation

  1. Clone and configure

    git clone https://github.com/your-username/contentdm-ai-generator.git
    cd contentdm-ai-generator
    cp config.example.yaml config.yaml
  2. Build and run with Docker Compose

    docker-compose up -d
  3. Access the application

Production Deployment

For production deployment with nginx reverse proxy:

docker-compose --profile production up -d

βš™οΈ Configuration

Basic Configuration (config.yaml)

contentdm:
  base_url: "https://vu.contentdm.oclc.org"
  api_endpoint: "/digital/bl/dmwebservices/index.php"
  default_collection: "vko"
  timeout: 30

ai_models:
  image_captioning:
    model_name: "Salesforce/blip-image-captioning-base"
    device: "auto"  # auto, cpu, cuda
  
  ner:
    model: "en_core_web_sm"
    enable_wikidata: true
    enable_dbpedia: true
    confidence_threshold: 0.7

export:
  output_dir: "outputs"
  csv_encoding: "utf-8"
  zip_compression: true

Environment Variables

Key environment variables for Docker deployment:

  • CONTENTDM_BASE_URL: Override ContentDM base URL
  • AI_DEVICE: Force CPU/GPU usage (cpu, cuda, auto)
  • OUTPUT_DIR: Custom output directory path
  • LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR)

πŸ“– Usage Guide

Basic Workflow

  1. Start the application and navigate to the ContentDM browser
  2. Browse to an item detail page (URL containing /id/)
  3. Load AI models using the "Load AI Models" button
  4. Generate metadata with "Auto Generate Additional Metadata"
  5. Review results in the expandable sections
  6. Save or export the enhanced metadata

Navigation Options

  • Manual URL Entry: Enter ContentDM URLs directly
  • Quick Navigation: Use browse, search, and home buttons
  • History: Access recently visited items

Processing Results

The application generates:

  • Object Description: AI-generated image caption
  • Text Transcription: OCR-extracted text content
  • Named Entities: People, places, organizations with confidence scores
  • Linked Data URIs: Wikidata and DBpedia links
  • Enhanced Dublin Core: Improved DC metadata fields

Data Export Options

  1. Single Item Export

    • CSV file with combined metadata
    • JSON data package with schema
    • README documentation
  2. Collection Export

    • Individual CSV files per item
    • Combined collection CSV
    • ZIP package with metadata

Batch Processing

For processing entire collections:

  1. Navigate to any item in the target collection
  2. Click "Process Collection" to start batch processing
  3. Monitor progress in the processing log
  4. Use "Export All" to create collection package

πŸ”§ API Integration

ContentDM API Endpoints Used

  • dmGetItemInfo: Fetch item metadata
  • dmGetImageInfo: Get image technical metadata
  • dmGetFile: Download image files
  • dmQuery: Search and browse collections
  • dmGetCollectionInfo: Collection metadata

Supported URL Patterns

The application detects ContentDM item pages from URLs like:

  • https://site.contentdm.oclc.org/digital/collection/alias/id/123
  • https://site.contentdm.oclc.org/alias/id/123
  • Query parameter formats with collection= and id=

πŸ€– AI Models

Image Captioning

  • Model: Salesforce BLIP (Bootstrapping Language-Image Pre-training)
  • Purpose: Generate descriptive captions for images
  • Output: Natural language descriptions of visual content

OCR Processing

  • Engine: Tesseract OCR
  • Languages: Configurable (default: English)
  • Preprocessing: Denoising, contrast enhancement, binarization

Named Entity Recognition

  • Model: spaCy English model (en_core_web_sm)
  • Entities: PERSON, ORG, GPE, LOC, DATE, etc.
  • Linking: Automatic linking to Wikidata and DBpedia

Performance Optimization

  • GPU Support: Automatic CUDA detection for image processing
  • Caching: Model and result caching for improved performance
  • Batch Processing: Efficient processing of multiple items

πŸ“Š Data Standards

CSV Output Format

Each processed item generates a CSV with these field categories:

  • original_*: Fields from ContentDM API
  • ai_*: AI-generated content and analysis
  • dc_*: Enhanced Dublin Core fields
  • processing_*: Metadata about the processing

JSON Data Package

Follows the Frictionless Data standard:

  • datapackage.json: Package descriptor with schema
  • Table schema definitions for CSV files
  • Resource metadata and relationships
  • Contributor and source information

Linked Data Integration

  • Wikidata URIs: http://www.wikidata.org/entity/Q*
  • DBpedia URIs: http://dbpedia.org/resource/*
  • SPARQL Queries: Automatic entity resolution
  • Confidence Scoring: Quality metrics for linked entities

πŸ› Troubleshooting

Common Issues

  1. AI Models Won't Load

    # Check available memory and disk space
    df -h
    free -h
    
    # For CUDA issues:
    nvidia-smi
  2. ContentDM API Timeouts

    # Increase timeout in config.yaml
    contentdm:
      timeout: 60
      max_retries: 5
  3. OCR Not Working

    # Install/reinstall Tesseract
    sudo apt-get install tesseract-ocr tesseract-ocr-eng
  4. Permission Errors

    # Fix output directory permissions
    sudo chown -R $USER:$USER outputs/
    chmod -R 755 outputs/

Performance Optimization

  • Memory Usage: Use CPU-only mode for limited RAM: device: "cpu"
  • Processing Speed: Reduce batch size: batch_size: 5
  • Storage: Enable cleanup of old files: Set retention policies

Logging

Check application logs for detailed error information:

tail -f contentdm_ai.log

For Docker deployments:

docker-compose logs -f contentdm-ai

πŸ”’ Security Considerations

Data Privacy

  • All processing occurs locally or in your controlled environment
  • No data sent to external AI services
  • ContentDM API calls only to configured endpoints

Network Security

  • Configure firewall rules for production deployment
  • Use HTTPS in production (nginx configuration provided)
  • Implement authentication if needed (external auth proxy)

Content Security

  • iframe restrictions apply to ContentDM embedding
  • CORS and XSRF protections configurable
  • Input validation for all user-provided URLs

🀝 Contributing

Development Setup

  1. Fork the repository and clone your fork
  2. Create development branch: git checkout -b feature/your-feature
  3. Install dev dependencies: pip install -r requirements-dev.txt
  4. Make changes and test thoroughly
  5. Submit pull request with detailed description

Code Standards

  • Follow PEP 8 Python style guidelines
  • Add type hints for new functions
  • Include docstrings for public methods
  • Write tests for new functionality

Testing

# Run tests
python -m pytest tests/

# Run with coverage
python -m pytest --cov=src tests/

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“š Citation

If you use this software in research or academic work, please cite:

@software{contentdm_ai_generator,
  title = {ContentDM AI Metadata Generator},
  author = {Vanderfeesten, Maurice},
  year = {2024},
  url = {https://github.com/your-username/contentdm-ai-generator},
  doi = {10.5281/zenodo.XXXXXX}
}

πŸ™‹ Support

Documentation

Community

Professional Support

For institutional deployments, training, or custom development:

  • Contact: Maurice Vanderfeesten (ORCID: 0000-0001-6397-4759)
  • Commercial support and consulting available

ContentDM AI Metadata Generator - Enhancing digital cultural heritage with artificial intelligence.

About

An intelligent Streamlit application that generates enhanced metadata for ContentDM digital collections using AI technologies including image captioning, OCR, and Named Entity Recognition with linked data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors