Skip to content

sidataplus/UMLS-to-DuckDB

Repository files navigation

UMLS to DuckDB Conversion Scripts

This repository contains scripts to convert the Unified Medical Language System (UMLS) data from RRF (Rich Release Format) files to DuckDB format, with additional views for user-friendly access and OMOP CDM v5.4 compatibility.

Overview

The UMLS is a comprehensive medical terminology system created by the National Library of Medicine (NLM). It integrates over 200 source vocabularies and contains millions of biomedical concepts with relationships between them. The original RRF format is challenging to work with directly, hence this conversion to a modern analytical database format.

Source Data

UMLS Source: The scripts expect UMLS data in the META/ directory in RRF format, which can be downloaded from:

  • Primary Source: UMLS Terminology Services (UTS)
  • License Required: Free but requires registration and acceptance of UMLS License Agreement
  • Current Version: Scripts tested with UMLS 2024AA release

Key UMLS Files Used

  • MRCONSO.RRF - Concept names and sources (7.3M+ rows, 2.45M unique concepts)
  • MRREL.RRF - Relationships between concepts
  • MRSTY.RRF - Semantic types for concepts
  • MRSAB.RRF - Source vocabulary abbreviations
  • MRDOC.RRF - Documentation for abbreviations and values

Why These Scripts Are Needed

1. Data Format Challenges

  • RRF Format: Pipe-delimited files with cryptic column names and coding schemes
  • Size: 7+ million rows with complex relationships across dozens of files
  • Performance: RRF files are not indexed and queries are extremely slow

2. Analytical Requirements

  • Modern SQL: Need for window functions, CTEs, and analytical queries
  • Performance: Fast aggregations and joins for research and applications
  • Integration: Easy integration with modern data science toolchains

3. Usability Issues

  • Cryptic Names: Table names like MRCONSO, column names like CUI, AUI, STR
  • Complex Codes: Term types (TTY), relationship types (REL), semantic types (STY)
  • Documentation: Requires deep UMLS knowledge to understand data structure

4. OMOP CDM Compatibility

  • Healthcare Analytics: Many healthcare organizations use OMOP CDM
  • Standardization: OMOP provides standardized vocabulary structure
  • Tools: Existing OMOP-compatible analytical tools and frameworks

Scripts Overview

Core Conversion Scripts

  1. convert_umls_to_duckdb.sh - Main conversion script

    • Converts RRF files to DuckDB tables
    • Creates indexes for performance
    • Handles data type conversions and null values
  2. create_friendly_views.sql - User-friendly views

    • Schema: umls_friendly
    • Intuitive table and column names
    • Simplified structure for common queries
  3. create_omop_views.sql - OMOP CDM v5.4 compatible views

    • Schema: umls_omop
    • Full OMOP vocabulary table structure
    • Includes ALL 2.45M unique concepts (one per CUI)
    • Proper language concept ID mappings from language_concepts.csv
    • All terms/languages in concept_synonym table (not just English)

Support Scripts

  1. apply_friendly_views.sh - Applies friendly views to database
  2. apply_omop_views.sh - Applies OMOP views to database
  3. analyze_concept_counts.sql - Analysis queries for data validation

Usage

macOS

Complete Setup (with scripts):

# Install DuckDB
brew install duckdb

# Run all conversion steps
./convert_umls_to_duckdb.sh
./apply_friendly_views.sh
./apply_omop_views.sh

Direct DuckDB Commands (no scripts needed):

# Install DuckDB
brew install duckdb

# Manual conversion (if you prefer direct commands)
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sql

Windows

Complete Setup (with Batch Files):

# Install DuckDB (via Chocolatey)
choco install duckdb

# Run all conversion steps with Windows batch files
convert_umls_to_duckdb.bat
apply_friendly_views.bat
apply_omop_views.bat

Alternative (with WSL):

# Use WSL with the bash scripts
wsl ./convert_umls_to_duckdb.sh
wsl ./apply_friendly_views.sh
wsl ./apply_omop_views.sh

Direct DuckDB Commands:

# Install DuckDB then run direct commands
duckdb.exe umls.duckdb < create_friendly_views.sql
duckdb.exe umls.duckdb < create_omop_views.sql

Alternative (via conda/pip):

# Install DuckDB via conda
conda install -c conda-forge duckdb

# Or via pip
pip install duckdb

# Direct commands
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sql

Linux

Complete Setup (Ubuntu/Debian):

# Install DuckDB
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
sudo mv duckdb /usr/local/bin/

# Run conversion with scripts
./convert_umls_to_duckdb.sh
./apply_friendly_views.sh
./apply_omop_views.sh

Direct DuckDB Commands:

# Direct conversion (no scripts needed)
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sql

Alternative (via package manager):

# CentOS/RHEL/Fedora
sudo dnf install duckdb
# or
sudo yum install duckdb

# Arch Linux
sudo pacman -S duckdb

# Via conda (any Linux)
conda install -c conda-forge duckdb

# Direct commands
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sql

Query Examples

Friendly Schema:

-- Find diabetes concepts
SELECT concept_id, term_string, source_vocabulary 
FROM umls_friendly.concepts 
WHERE term_string ILIKE '%diabetes%' 
LIMIT 10;

OMOP Schema:

-- Find drug concepts with all synonyms
SELECT 
    c.concept_name,
    cs.concept_synonym_name,
    l.language_name
FROM umls_omop.concept c
JOIN umls_omop.concept_synonym cs ON c.concept_id = cs.concept_id
JOIN umls_omop.language_summary l ON cs.language_concept_id = l.language_concept_id
WHERE c.domain_id = 'Drug'
  AND c.vocabulary_id = 'RXNORM'
LIMIT 20;

Database Schema Overview

Raw UMLS Schema (umls)

  • Direct mapping from RRF files
  • Original table and column names
  • Fastest queries, requires UMLS knowledge

Friendly Schema (umls_friendly)

  • Intuitive names and structure
  • Good for exploratory analysis
  • Simplified relationships

OMOP Schema (umls_omop)

  • Full OMOP CDM v5.4 compatibility
  • Standardized healthcare vocabulary structure
  • Compatible with OMOP analytical tools

Data Statistics

  • Total MRCONSO rows: 7.3M+ (all terms across all languages/sources)
  • Unique concepts (CUIs): 2.45M (corrected in OMOP views)
  • Languages supported: 50+ languages with proper OMOP concept IDs
  • Source vocabularies: 200+ (SNOMED CT, RxNorm, ICD-10, etc.)
  • Relationships: Millions of semantic relationships

Requirements

  • DuckDB: v0.8.0 or later (brew install duckdb)
  • Storage: ~10GB for full conversion with indexes
  • Memory: 8GB+ recommended for large queries
  • UMLS License: Required for accessing source data

Performance Notes

  • Conversion Time: ~30 minutes for full UMLS dataset
  • Query Performance: 100x+ faster than original RRF files
  • Indexes: Created automatically for common query patterns
  • Compression: DuckDB provides excellent compression ratios

License and Attribution

UMLS Data

  • Source: National Library of Medicine (NLM)
  • License: UMLS License Agreement (free registration required)
  • Citation: U.S. National Library of Medicine. Unified Medical Language System (UMLS). Bethesda, MD.

Scripts

  • License: MIT (see LICENSE file)
  • Usage: Free for research and commercial use
  • Attribution: Please cite this repository if used in research

Troubleshooting

Common Issues

  1. Missing META directory: Ensure UMLS data is extracted to META/ subdirectory
  2. Memory errors: Reduce batch size in conversion scripts or increase available RAM
  3. Permission errors: Ensure DuckDB can write to output directory
  4. Language mapping errors: Verify language_concepts.csv is present and properly formatted

Validation Queries

-- Verify concept counts
SELECT 'Total unique CUIs' as metric, COUNT(DISTINCT concept_id) as count FROM umls_omop.concept;

-- Check language distribution  
SELECT * FROM umls_omop.language_summary ORDER BY synonym_count DESC LIMIT 10;

-- Validate OMOP structure
SELECT table_name, COUNT(*) as row_count FROM (
    SELECT 'concept' as table_name FROM umls_omop.concept
    UNION ALL
    SELECT 'concept_synonym' as table_name FROM umls_omop.concept_synonym
    -- ... other tables
) GROUP BY table_name;

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create feature branch
  3. Add tests for new functionality
  4. Submit pull request with clear description

Support

For questions or issues:

  1. Check existing GitHub issues
  2. Review UMLS documentation at UTS
  3. Create new issue with reproducible example

About

Convert the Unified Medical Language System (UMLS) data from RRF files to DuckDB format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published