This repository contains scripts to convert the Unified Medical Language System (UMLS) data from RRF (Rich Release Format) files to DuckDB format, with additional views for user-friendly access and OMOP CDM v5.4 compatibility.
The UMLS is a comprehensive medical terminology system created by the National Library of Medicine (NLM). It integrates over 200 source vocabularies and contains millions of biomedical concepts with relationships between them. The original RRF format is challenging to work with directly, hence this conversion to a modern analytical database format.
UMLS Source: The scripts expect UMLS data in the META/ directory in RRF format, which can be downloaded from:
- Primary Source: UMLS Terminology Services (UTS)
- License Required: Free but requires registration and acceptance of UMLS License Agreement
- Current Version: Scripts tested with UMLS 2024AA release
MRCONSO.RRF- Concept names and sources (7.3M+ rows, 2.45M unique concepts)MRREL.RRF- Relationships between conceptsMRSTY.RRF- Semantic types for conceptsMRSAB.RRF- Source vocabulary abbreviationsMRDOC.RRF- Documentation for abbreviations and values
- RRF Format: Pipe-delimited files with cryptic column names and coding schemes
- Size: 7+ million rows with complex relationships across dozens of files
- Performance: RRF files are not indexed and queries are extremely slow
- Modern SQL: Need for window functions, CTEs, and analytical queries
- Performance: Fast aggregations and joins for research and applications
- Integration: Easy integration with modern data science toolchains
- Cryptic Names: Table names like
MRCONSO, column names likeCUI,AUI,STR - Complex Codes: Term types (TTY), relationship types (REL), semantic types (STY)
- Documentation: Requires deep UMLS knowledge to understand data structure
- Healthcare Analytics: Many healthcare organizations use OMOP CDM
- Standardization: OMOP provides standardized vocabulary structure
- Tools: Existing OMOP-compatible analytical tools and frameworks
-
convert_umls_to_duckdb.sh- Main conversion script- Converts RRF files to DuckDB tables
- Creates indexes for performance
- Handles data type conversions and null values
-
create_friendly_views.sql- User-friendly views- Schema:
umls_friendly - Intuitive table and column names
- Simplified structure for common queries
- Schema:
-
create_omop_views.sql- OMOP CDM v5.4 compatible views- Schema:
umls_omop - Full OMOP vocabulary table structure
- Includes ALL 2.45M unique concepts (one per CUI)
- Proper language concept ID mappings from
language_concepts.csv - All terms/languages in
concept_synonymtable (not just English)
- Schema:
apply_friendly_views.sh- Applies friendly views to databaseapply_omop_views.sh- Applies OMOP views to databaseanalyze_concept_counts.sql- Analysis queries for data validation
Complete Setup (with scripts):
# Install DuckDB
brew install duckdb
# Run all conversion steps
./convert_umls_to_duckdb.sh
./apply_friendly_views.sh
./apply_omop_views.shDirect DuckDB Commands (no scripts needed):
# Install DuckDB
brew install duckdb
# Manual conversion (if you prefer direct commands)
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sqlComplete Setup (with Batch Files):
# Install DuckDB (via Chocolatey)
choco install duckdb
# Run all conversion steps with Windows batch files
convert_umls_to_duckdb.bat
apply_friendly_views.bat
apply_omop_views.batAlternative (with WSL):
# Use WSL with the bash scripts
wsl ./convert_umls_to_duckdb.sh
wsl ./apply_friendly_views.sh
wsl ./apply_omop_views.shDirect DuckDB Commands:
# Install DuckDB then run direct commands
duckdb.exe umls.duckdb < create_friendly_views.sql
duckdb.exe umls.duckdb < create_omop_views.sqlAlternative (via conda/pip):
# Install DuckDB via conda
conda install -c conda-forge duckdb
# Or via pip
pip install duckdb
# Direct commands
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sqlComplete Setup (Ubuntu/Debian):
# Install DuckDB
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
sudo mv duckdb /usr/local/bin/
# Run conversion with scripts
./convert_umls_to_duckdb.sh
./apply_friendly_views.sh
./apply_omop_views.shDirect DuckDB Commands:
# Direct conversion (no scripts needed)
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sqlAlternative (via package manager):
# CentOS/RHEL/Fedora
sudo dnf install duckdb
# or
sudo yum install duckdb
# Arch Linux
sudo pacman -S duckdb
# Via conda (any Linux)
conda install -c conda-forge duckdb
# Direct commands
duckdb umls.duckdb < convert_umls_tables.sql
duckdb umls.duckdb < create_friendly_views.sql
duckdb umls.duckdb < create_omop_views.sqlFriendly Schema:
-- Find diabetes concepts
SELECT concept_id, term_string, source_vocabulary
FROM umls_friendly.concepts
WHERE term_string ILIKE '%diabetes%'
LIMIT 10;OMOP Schema:
-- Find drug concepts with all synonyms
SELECT
c.concept_name,
cs.concept_synonym_name,
l.language_name
FROM umls_omop.concept c
JOIN umls_omop.concept_synonym cs ON c.concept_id = cs.concept_id
JOIN umls_omop.language_summary l ON cs.language_concept_id = l.language_concept_id
WHERE c.domain_id = 'Drug'
AND c.vocabulary_id = 'RXNORM'
LIMIT 20;- Direct mapping from RRF files
- Original table and column names
- Fastest queries, requires UMLS knowledge
- Intuitive names and structure
- Good for exploratory analysis
- Simplified relationships
- Full OMOP CDM v5.4 compatibility
- Standardized healthcare vocabulary structure
- Compatible with OMOP analytical tools
- Total MRCONSO rows: 7.3M+ (all terms across all languages/sources)
- Unique concepts (CUIs): 2.45M (corrected in OMOP views)
- Languages supported: 50+ languages with proper OMOP concept IDs
- Source vocabularies: 200+ (SNOMED CT, RxNorm, ICD-10, etc.)
- Relationships: Millions of semantic relationships
- DuckDB: v0.8.0 or later (
brew install duckdb) - Storage: ~10GB for full conversion with indexes
- Memory: 8GB+ recommended for large queries
- UMLS License: Required for accessing source data
- Conversion Time: ~30 minutes for full UMLS dataset
- Query Performance: 100x+ faster than original RRF files
- Indexes: Created automatically for common query patterns
- Compression: DuckDB provides excellent compression ratios
- Source: National Library of Medicine (NLM)
- License: UMLS License Agreement (free registration required)
- Citation: U.S. National Library of Medicine. Unified Medical Language System (UMLS). Bethesda, MD.
- License: MIT (see LICENSE file)
- Usage: Free for research and commercial use
- Attribution: Please cite this repository if used in research
- Missing META directory: Ensure UMLS data is extracted to
META/subdirectory - Memory errors: Reduce batch size in conversion scripts or increase available RAM
- Permission errors: Ensure DuckDB can write to output directory
- Language mapping errors: Verify
language_concepts.csvis present and properly formatted
-- Verify concept counts
SELECT 'Total unique CUIs' as metric, COUNT(DISTINCT concept_id) as count FROM umls_omop.concept;
-- Check language distribution
SELECT * FROM umls_omop.language_summary ORDER BY synonym_count DESC LIMIT 10;
-- Validate OMOP structure
SELECT table_name, COUNT(*) as row_count FROM (
SELECT 'concept' as table_name FROM umls_omop.concept
UNION ALL
SELECT 'concept_synonym' as table_name FROM umls_omop.concept_synonym
-- ... other tables
) GROUP BY table_name;Contributions welcome! Please:
- Fork the repository
- Create feature branch
- Add tests for new functionality
- Submit pull request with clear description
For questions or issues:
- Check existing GitHub issues
- Review UMLS documentation at UTS
- Create new issue with reproducible example