UMLS to DuckDB Conversion Scripts

This repository contains scripts to convert the Unified Medical Language System (UMLS) data from RRF (Rich Release Format) files to DuckDB format, with additional views for user-friendly access and OMOP CDM v5.4 compatibility.

Overview

The UMLS is a comprehensive medical terminology system created by the National Library of Medicine (NLM). It integrates over 200 source vocabularies and contains millions of biomedical concepts with relationships between them. The original RRF format is challenging to work with directly, hence this conversion to a modern analytical database format.

Source Data

UMLS Source: The scripts expect UMLS data in the META/ directory in RRF format, which can be downloaded from:

Primary Source: UMLS Terminology Services (UTS)
License Required: Free but requires registration and acceptance of UMLS License Agreement
Current Version: Scripts tested with UMLS 2024AA release

Key UMLS Files Used

MRCONSO.RRF - Concept names and sources (7.3M+ rows, 2.45M unique concepts)
MRREL.RRF - Relationships between concepts
MRSTY.RRF - Semantic types for concepts
MRSAB.RRF - Source vocabulary abbreviations
MRDOC.RRF - Documentation for abbreviations and values

Why These Scripts Are Needed

1. Data Format Challenges

RRF Format: Pipe-delimited files with cryptic column names and coding schemes
Size: 7+ million rows with complex relationships across dozens of files
Performance: RRF files are not indexed and queries are extremely slow

2. Analytical Requirements

Modern SQL: Need for window functions, CTEs, and analytical queries
Performance: Fast aggregations and joins for research and applications
Integration: Easy integration with modern data science toolchains

3. Usability Issues

Cryptic Names: Table names like MRCONSO, column names like CUI, AUI, STR
Complex Codes: Term types (TTY), relationship types (REL), semantic types (STY)
Documentation: Requires deep UMLS knowledge to understand data structure

4. OMOP CDM Compatibility

Healthcare Analytics: Many healthcare organizations use OMOP CDM
Standardization: OMOP provides standardized vocabulary structure
Tools: Existing OMOP-compatible analytical tools and frameworks

Scripts Overview

Core Conversion Scripts

convert_umls_to_duckdb.sh - Main conversion script
- Converts RRF files to DuckDB tables
- Creates indexes for performance
- Handles data type conversions and null values
create_friendly_views.sql - User-friendly views
- Schema: umls_friendly
- Intuitive table and column names
- Simplified structure for common queries
create_omop_views.sql - OMOP CDM v5.4 compatible views
- Schema: umls_omop
- Full OMOP vocabulary table structure
- Includes ALL 2.45M unique concepts (one per CUI)
- Proper language concept ID mappings from language_concepts.csv
- All terms/languages in concept_synonym table (not just English)

Support Scripts

apply_friendly_views.sh - Applies friendly views to database
apply_omop_views.sh - Applies OMOP views to database
analyze_concept_counts.sql - Analysis queries for data validation

Usage

macOS

Complete Setup (with scripts):

# Install DuckDB
brew install duckdb

# Run all conversion steps
./convert_umls_to_duckdb.sh
./apply_friendly_views.sh
./apply_omop_views.sh

Direct DuckDB Commands (no scripts needed):

# Install DuckDB
brew install duckdb

# Manual conversion (if you prefer direct commands)
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sql

Windows

Complete Setup (with Batch Files):

# Install DuckDB (via Chocolatey)
choco install duckdb

# Run all conversion steps with Windows batch files
convert_umls_to_duckdb.bat
apply_friendly_views.bat
apply_omop_views.bat

Alternative (with WSL):

# Use WSL with the bash scripts
wsl ./convert_umls_to_duckdb.sh
wsl ./apply_friendly_views.sh
wsl ./apply_omop_views.sh

Direct DuckDB Commands:

# Install DuckDB then run direct commands
duckdb.exe umls.duckdb < create_friendly_views.sql
duckdb.exe umls.duckdb < create_omop_views.sql

Alternative (via conda/pip):

# Install DuckDB via conda
conda install -c conda-forge duckdb

# Or via pip
pip install duckdb

# Direct commands
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sql

Linux

Complete Setup (Ubuntu/Debian):

# Install DuckDB
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
sudo mv duckdb /usr/local/bin/

# Run conversion with scripts
./convert_umls_to_duckdb.sh
./apply_friendly_views.sh
./apply_omop_views.sh

Direct DuckDB Commands:

# Direct conversion (no scripts needed)
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sql

Alternative (via package manager):

# CentOS/RHEL/Fedora
sudo dnf install duckdb
# or
sudo yum install duckdb

# Arch Linux
sudo pacman -S duckdb

# Via conda (any Linux)
conda install -c conda-forge duckdb

# Direct commands
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sql

Query Examples

Friendly Schema:

-- Find diabetes concepts
SELECT concept_id, term_string, source_vocabulary 
FROM umls_friendly.concepts 
WHERE term_string ILIKE '%diabetes%' 
LIMIT 10;

OMOP Schema:

-- Find drug concepts with all synonyms
SELECT 
    c.concept_name,
    cs.concept_synonym_name,
    l.language_name
FROM umls_omop.concept c
JOIN umls_omop.concept_synonym cs ON c.concept_id = cs.concept_id
JOIN umls_omop.language_summary l ON cs.language_concept_id = l.language_concept_id
WHERE c.domain_id = 'Drug'
  AND c.vocabulary_id = 'RXNORM'
LIMIT 20;

Database Schema Overview

Raw UMLS Schema (`umls`)

Direct mapping from RRF files
Original table and column names
Fastest queries, requires UMLS knowledge

Friendly Schema (`umls_friendly`)

Intuitive names and structure
Good for exploratory analysis
Simplified relationships

OMOP Schema (`umls_omop`)

Full OMOP CDM v5.4 compatibility
Standardized healthcare vocabulary structure
Compatible with OMOP analytical tools

Data Statistics

Total MRCONSO rows: 7.3M+ (all terms across all languages/sources)
Unique concepts (CUIs): 2.45M (corrected in OMOP views)
Languages supported: 50+ languages with proper OMOP concept IDs
Source vocabularies: 200+ (SNOMED CT, RxNorm, ICD-10, etc.)
Relationships: Millions of semantic relationships

Requirements

DuckDB: v0.8.0 or later (brew install duckdb)
Storage: ~10GB for full conversion with indexes
Memory: 8GB+ recommended for large queries
UMLS License: Required for accessing source data

Performance Notes

Conversion Time: ~30 minutes for full UMLS dataset
Query Performance: 100x+ faster than original RRF files
Indexes: Created automatically for common query patterns
Compression: DuckDB provides excellent compression ratios

License and Attribution

UMLS Data

Source: National Library of Medicine (NLM)
License: UMLS License Agreement (free registration required)
Citation: U.S. National Library of Medicine. Unified Medical Language System (UMLS). Bethesda, MD.

Scripts

License: MIT (see LICENSE file)
Usage: Free for research and commercial use
Attribution: Please cite this repository if used in research

Troubleshooting

Common Issues

Missing META directory: Ensure UMLS data is extracted to META/ subdirectory
Memory errors: Reduce batch size in conversion scripts or increase available RAM
Permission errors: Ensure DuckDB can write to output directory
Language mapping errors: Verify language_concepts.csv is present and properly formatted

Validation Queries

-- Verify concept counts
SELECT 'Total unique CUIs' as metric, COUNT(DISTINCT concept_id) as count FROM umls_omop.concept;

-- Check language distribution  
SELECT * FROM umls_omop.language_summary ORDER BY synonym_count DESC LIMIT 10;

-- Validate OMOP structure
SELECT table_name, COUNT(*) as row_count FROM (
    SELECT 'concept' as table_name FROM umls_omop.concept
    UNION ALL
    SELECT 'concept_synonym' as table_name FROM umls_omop.concept_synonym
    -- ... other tables
) GROUP BY table_name;

Contributing

Contributions welcome! Please:

Fork the repository
Create feature branch
Add tests for new functionality
Submit pull request with clear description

Support

For questions or issues:

Check existing GitHub issues
Review UMLS documentation at UTS
Create new issue with reproducible example

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
META		META
.gitignore		.gitignore
Copyright_Notice.txt		Copyright_Notice.txt
README.md		README.md
analyze_concept_counts.sql		analyze_concept_counts.sql
apply_friendly_views.bat		apply_friendly_views.bat
apply_friendly_views.sh		apply_friendly_views.sh
apply_omop_views.bat		apply_omop_views.bat
apply_omop_views.sh		apply_omop_views.sh
convert_umls_to_duckdb.bat		convert_umls_to_duckdb.bat
convert_umls_to_duckdb.sh		convert_umls_to_duckdb.sh
create_friendly_views.sql		create_friendly_views.sql
create_omop_views.sql		create_omop_views.sql
create_umls_duckdb.sql		create_umls_duckdb.sql

sidataplus/UMLS-to-DuckDB

Folders and files

Latest commit

History

Repository files navigation