Skip to content

lsempe77/OM_QEX

Repository files navigation

OM_QEX - Outcome Mapping Quality of Evidence Exchange

A curated dataset of 114 studies on poverty graduation programs with full-text extractions and LLM-based data extraction tools.

πŸ“– Comprehensive Documentation: See docs/ for technical reports and performance analysis

βœ… Dataset Status (Nov 11, 2025): 114 studies ready for extraction | All GROBID outputs complete

πŸ“ Structure

OM_QEX/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                  # Master CSV (114 studies) + fulltext metadata (673 papers)
β”‚   β”œβ”€β”€ human_extraction/     # Manual extractions (ground truth)
β”‚   β”œβ”€β”€ grobid_outputs/       # 114 studies Γ— 2 formats (TEI XML + TXT) βœ…
β”‚   └── pdfs_from_zotero/     # 19 PDFs downloaded from Zotero (archived source)
β”œβ”€β”€ om_qex_extraction/        # πŸ†• LLM-based extraction app
β”‚   β”œβ”€β”€ src/                  # Extraction engine and parsers
β”‚   β”œβ”€β”€ prompts/              # LLM extraction prompts
β”‚   β”œβ”€β”€ config/               # Configuration (API keys)
β”‚   └── outputs/              # Extracted data (JSON + CSV)
β”œβ”€β”€ docs/                     # πŸ“„ Documentation
β”‚   β”œβ”€β”€ BASELINE_PERFORMANCE_REPORT.md    # Technical performance analysis
β”‚   β”œβ”€β”€ BASELINE_RESULTS_EMAIL.md         # Stakeholder summary
β”‚   β”œβ”€β”€ HUMAN_COMPARISON_RESULTS.md       # LLM vs human comparison
β”‚   └── CLEANUP_LOG.md                    # Project organization log
β”œβ”€β”€ scripts/                  # Data processing utilities (map IDs to keys, copy files)
β”œβ”€β”€ archive/                  # Historical files and one-time scripts
β”‚   └── zotero_sync_nov11/    # Zotero sync scripts, logs, and verification tools
β”œβ”€β”€ find_missing_in_zotero.py # Find studies in Zotero by EPPI-Reviewer ID
└── download_missing_pdfs.py  # Download PDFs from Zotero library

πŸ“Š Dataset

114 included studies on poverty graduation and ultra-poor programs - All ready for extraction βœ…

Dataset Expansion (Nov 11, 2025)

  • Expanded from 95 β†’ 114 studies (+19 studies)
  • All 114 studies processed through GROBID
  • 114 TEI XML files + 114 TXT files ready

Raw Data (data/raw/)

  • Master file (n=114) - Primary dataset with study metadata
  • fulltext_metadata (673 entries) - Maps all 114 study IDs to GROBID file Keys

Human Extraction (data/human_extraction/)

  • Manual data extraction - Ground truth for comparison with LLM extraction
    • QEX validation: 8 week SR QEX Pierre SOF and TEEP(Quant Extraction Form).csv (3 studies, detailed fields)
    • OM validation: OM_human_extraction.csv (9 valid studies, 57 outcomes total)
  • Prompt engineering input - Reference data for developing extraction prompts
  • Quality benchmark - Validation standard for automated extraction

⚠️ Note on OM_human_extraction.csv: Contains 3 special case studies (121498800, 121498801, 121498803) that are excluded from comparison (duplicates and qualitative-only). See data/README.md for details.

Full-Text Outputs (data/grobid_outputs/)

  • tei/ - 114 TEI XML files (structured with sections, references, metadata)
  • text/ - 114 plain text files (cleaned full-text extraction)
  • All files linked via Keys in fulltext_metadata.csv

πŸ› οΈ Tools & Scripts

LLM Extraction Application (om_qex_extraction/) ⭐

Automated outcome extraction from research papers using LLMs with dual-mode operation.

πŸ†• Two Extraction Modes

1. OM (Outcome Mapping) - Comprehensive outcome identification

  • Identifies ALL outcomes with statistical analysis
  • Simple categorization (outcome_group, outcome_category, location)
  • Output: ~14 outcomes per paper
  • Use case: Systematic review mapping, outcome inventory

2. QEX (Quantitative Extraction) - Detailed statistical extraction

  • Extracts complete statistical data for outcomes
  • Full details (effect_size, p_value, sample_sizes, graduation_components)
  • Output: Detailed data for meta-analysis
  • Use case: Statistical synthesis, detailed data extraction

3. Two-Stage Pipeline - OM guides QEX for maximum coverage

  • Stage 1 (OM): Find all outcomes with locations
  • Stage 2 (QEX): Extract details using OM hints
  • Result: 118% more outcomes than standalone QEX
  • 100% OMβ†’QEX conversion rate

πŸš€ Quick Start

cd om_qex_extraction

# Run two-stage extraction (recommended)
python run_twostage_extraction.py --keys PHRKN65M

# Or run modes separately:
python run_extraction.py --mode om --keys PHRKN65M    # Find all outcomes
python run_extraction.py --mode qex --keys PHRKN65M   # Extract details

πŸ“Š Baseline Performance (Nov 10, 2025)

Test Paper: PHRKN65M (Burchi & Strupat 2018, Malawi TEEP)

Approach Outcomes Found Tables Covered vs Human (9 outcomes)
Human extraction 9 Tables 6,8,11,13,15,16,17 Baseline
Regular QEX 6 Tables 4,6,7,8,9,10,11 67% coverage
Two-stage (OM→QEX) 14 Tables 5,6,7,9,10,12,13,15,17,18 156% of human
Improved OM (v2) 14 10 tables +56% vs human

Key Findings:

  • βœ… Two-stage pipeline: 118% improvement over regular QEX (6β†’14 outcomes)
  • βœ… 100% OMβ†’QEX conversion rate (all identified outcomes extracted)
  • ⚠️ Different table selection than human (57% overlap)
  • ⚠️ Paper has 22 results tables - both human and LLM select subsets
  • πŸ“ˆ LLM found 6 additional tables human didn't extract

Verification fields added:

  • literal_text: Exact quote from paper for manual verification
  • text_position: Precise location (Table X, Row Y, Column Z)

πŸ“š Documentation

βœ… System Status (Nov 10, 2025)

  • Architecture: Dual-mode (OM + QEX) with two-stage pipeline
  • Model: Claude 3.5 Haiku via OpenRouter API
  • Extraction: Working end-to-end with network retry logic
  • Test papers: 2 papers analyzed (PHRKN65M, ABM3E3ZP)
  • Baseline: 14 outcomes per paper (56% more than human extraction)
  • Coverage: Finding different tables than human - not necessarily worse
  • Precision: To be validated (next step)
  • Status: Ready for prompt engineering improvements

πŸ”§ Features

  • βœ… Dual-mode extraction: OM (outcome mapping) + QEX (quantitative extraction)
  • βœ… Two-stage pipeline: OM guides QEX for 118% better coverage
  • βœ… TEI XML parser for GROBID outputs
  • βœ… Verification fields (literal_text, text_position) for manual checking
  • βœ… Batch processing with robust network retry logic
  • βœ… JSON + CSV output formats
  • βœ… Handles complex multi-outcome papers (10-20+ outcomes per paper)
  • βœ… Comprehensive results scanning (continues through entire paper)

πŸ“Š Extracted Fields

OM (Outcome Mapping) Fields:

  • outcome_group (high-level category: Poverty, Income, Assets, etc.)
  • outcome_category (specific outcome name)
  • location (page, table, section reference)
  • literal_text (exact quote from paper)
  • text_position (precise location for verification)

QEX (Quantitative Extraction) Fields:

  • All OM fields plus:
  • outcome_description, evaluation_design, sample_sizes
  • effect_size, standard_error, p_value, confidence_interval
  • graduation_components (7 components: consumption, healthcare, assets, skills, savings, coaching, social)

See om_qex_extraction/prompts/ for prompt templates.


Data Processing Scripts (scripts/)

Utility scripts for data management and analysis:

  • map_ids_to_keys.py - Maps study IDs to GROBID Keys for extraction
  • copy_files_by_key.py - Extracts GROBID outputs for specific paper Keys

Zotero Sync Tools (Project Root) πŸ†•

Download PDFs from Zotero library for studies missing GROBID outputs:

  • find_missing_in_zotero.py - Search Zotero library for studies by EPPI-Reviewer ID

    • Searches 1,600+ Zotero items
    • Matches via extra field containing study IDs
    • Outputs mapping: Study ID β†’ Zotero Key β†’ PDF status
  • download_missing_pdfs.py - Download PDFs from Zotero

    • Uses mapping CSV from find_missing_in_zotero.py
    • Downloads only missing PDFs (skips existing)
    • Saves to data/pdfs_from_zotero/

Usage:

# 1. Find studies in Zotero
python find_missing_in_zotero.py

# 2. Download PDFs
python download_missing_pdfs.py

See archive/zotero_sync_nov11/README.md for full documentation.


Diagnostic Scripts (archive/)

Historical diagnostic scripts used during data cleaning and expansion:

Data Cleaning (Oct-Nov 2025):

  • find_duplicate_keys.py - Found duplicate study (121475488) sharing same Key
  • remove_duplicate.py - Cleaned master file from 96 β†’ 95 studies
  • test_stem.py - Diagnosed Path.stem behavior with .tei.xml files

Dataset Expansion (Nov 11, 2025):

  • analyze_raw_files.py - Analyzed Master CSV and fulltext_metadata relationship
  • check_pdf_coverage.py - Calculated initial PDF coverage (97/114)
  • find_missing_files.py - Identified 2 missing GROBID files
  • verify_extraction_ready.py - Final verification (114/114 complete)
  • Full logs and mapping files in archive/zotero_sync_nov11/

These scripts are archived for reference but not needed for normal use.


πŸš€ Quick Start

View the Dataset

View the Dataset

# Clone the repository
git clone https://github.com/lsempe77/OM_QEX.git
cd OM_QEX

# View master dataset
cat "data/raw/Master file of included studies (n=114) 11 Nov(data).csv"

# Access full-text files
# data/grobid_outputs/tei/  (95 TEI XML files - structured)
# data/grobid_outputs/text/ (95 TXT files - plain text)

Test LLM Extraction

cd om_qex_extraction

# Two-stage extraction (recommended - best coverage)
python run_twostage_extraction.py --keys PHRKN65M

# Or run modes separately:
python run_extraction.py --mode om --keys PHRKN65M    # Find all outcomes
python run_extraction.py --mode qex --keys PHRKN65M   # Extract detailed stats

# Compare with human ground truth
python compare_om_extractions.py    # Validate outcome identification
python compare_extractions.py       # Validate detailed extraction

# View results
python -c "import pandas as pd; df = pd.read_csv('outputs/twostage/stage2_qex/extracted_data.csv'); print(df[['outcome_category', 'literal_text', 'text_position']])"

Current baseline: 14 outcomes per paper, 56% more than human extraction.

See HUMAN_COMPARISON_RESULTS.md for detailed analysis.

Run Full Extraction

cd om_qex_extraction

# Setup API key (first time only)
cp config/config.yaml.template config/config.yaml
# Edit config.yaml and add your OpenRouter API key

# Install dependencies
pip install -r requirements.txt

# Extract all 95 papers (~10-15 min, ~$0.50-1.00)
python run_extraction.py --all


πŸ”— Linking IDs to Files

Papers have two identifiers:

  • Study ID (e.g., 121058352) - Used in master file and human extraction
  • Key (e.g., CV27ZK8Q) - Used for GROBID filenames

To find GROBID files for a paper:

  1. Look up Study ID in data/raw/fulltext_metadata.csv
  2. Find corresponding Key in the same row
  3. Access files: data/grobid_outputs/tei/[Key].tei.xml and data/grobid_outputs/text/[Key].txt

Example:

Study ID: 121058352 (Bandiera 2009)
β†’ fulltext_metadata.csv: Key = CV27ZK8Q
β†’ Files: CV27ZK8Q.tei.xml, CV27ZK8Q.txt

Shortcut: Map study IDs to Keys using scripts/map_ids_to_keys.py


πŸ“Š Test Papers (Human Ground Truth)

For LLM validation, 3 studies have manual human extraction:

Study ID Key Author Year Program Country Status
121294984 PHRKN65M Burchi & Strupat 2018 TEEP Malawi βœ… In master (9 outcomes)
121058364 ABM3E3ZP Maldonado et al. 2019 SOF Paraguay βœ… In master
121498842 - Mahecha et al. - SOF Paraguay ❌ Not in master

Only 2/3 papers can be tested (121498842 was excluded from final dataset).

See om_qex_extraction/TESTING_WORKFLOW.md for testing details.


πŸ“ Notes

πŸ“ Notes

  • Dataset: 95 poverty graduation program studies (cleaned from 96 - duplicate removed)
  • Full-text processing: GROBID PDF extraction β†’ TEI XML + plain text
  • ID linking: All Study IDs mapped to Keys via fulltext_metadata.csv
  • LLM extraction: Claude 3.5 Haiku via OpenRouter API
  • Extraction modes: OM (outcome mapping) + QEX (quantitative extraction) + Two-stage pipeline
  • Baseline performance: 14 outcomes per paper (56% more than human extraction)
  • Coverage: Different table selection than human (57% overlap, 6 additional tables found)
  • Status: System working, baseline established, ready for prompt engineering improvements
  • Next steps: Improve table coverage (missing 3/7 human-selected tables), validate precision

πŸ“‚ Repository Contents

OM_QEX/
β”œβ”€β”€ README.md                          # This file - project overview
β”œβ”€β”€ DOCUMENTATION_UPDATE.md            # Documentation changelog (Nov 10, 2025)
β”œβ”€β”€ EXTRACTION_PLAN.md                 # Original extraction planning document
β”œβ”€β”€ .gitignore                         # Git ignore rules
β”‚
β”œβ”€β”€ data/                              # Dataset files
β”‚   β”œβ”€β”€ README.md                      # Data documentation with test papers
β”‚   β”œβ”€β”€ raw/                           # Metadata CSVs
β”‚   β”‚   β”œβ”€β”€ Master file (n=114).csv    # Primary dataset βœ…
β”‚   β”‚   └── fulltext_metadata.csv      # ID β†’ Key mapping
β”‚   β”œβ”€β”€ human_extraction/              # Ground truth (3 studies, 2 in master)
β”‚   └── grobid_outputs/                # Full-text extractions (114 Γ— 2)
β”‚       β”œβ”€β”€ tei/                       # TEI XML (structured)
β”‚       └── text/                      # Plain text
β”‚
β”œβ”€β”€ om_qex_extraction/                 # LLM extraction application ⭐
β”‚   β”œβ”€β”€ README.md                      # App documentation
β”‚   β”œβ”€β”€ TESTING_WORKFLOW.md            # Complete testing guide
β”‚   β”œβ”€β”€ TEST_RESULTS.md                # Current baseline & findings
β”‚   β”œβ”€β”€ QUICK_REFERENCE.md             # Commands cheat sheet
β”‚   β”œβ”€β”€ COMPARISON_GUIDE.md            # Understanding results
β”‚   β”œβ”€β”€ EXTRACTION_READY.md            # System documentation
β”‚   β”œβ”€β”€ run_extraction.py              # Main extraction CLI
β”‚   β”œβ”€β”€ compare_extractions.py         # LLM vs human comparison
β”‚   β”œβ”€β”€ requirements.txt               # Python dependencies
β”‚   β”œβ”€β”€ src/                           # Source code
β”‚   β”‚   β”œβ”€β”€ models.py                  # Pydantic data models
β”‚   β”‚   β”œβ”€β”€ tei_parser.py              # TEI XML parser
β”‚   β”‚   β”œβ”€β”€ extraction_engine.py       # LLM extraction logic
β”‚   β”‚   └── comparer.py                # Comparison system
β”‚   β”œβ”€β”€ prompts/                       # LLM prompts
β”‚   β”œβ”€β”€ config/                        # Configuration files
β”‚   └── outputs/                       # Extraction results (gitignored)
β”‚
β”œβ”€β”€ scripts/                           # Utility scripts
β”‚   β”œβ”€β”€ add_key_column.py              # ID β†’ Key mapping
β”‚   β”œβ”€β”€ copy_files_by_key.py           # File extraction
β”‚   β”œβ”€β”€ get_human_study_ids.py         # List test papers
β”‚   └── map_ids_to_keys.py             # ID β†’ Key lookup
β”‚
└── archive/                           # Diagnostic scripts (historical)
    β”œβ”€β”€ find_duplicate_keys.py         # Found duplicate study
    β”œβ”€β”€ remove_duplicate.py            # Cleaned master file
    β”œβ”€β”€ test_stem.py                   # Path.stem diagnostics
    └── ...                            # Other data cleaning tools

πŸ” Key Files

  • Start here: om_qex_extraction/TESTING_WORKFLOW.md
  • Master dataset: data/raw/Master file of included studies (n=114) 11 Nov(data).csv
  • Test results: om_qex_extraction/TEST_RESULTS.md
  • Run extraction: om_qex_extraction/run_extraction.py
  • Compare results: om_qex_extraction/compare_extractions.py

🀝 Contributing

This is a research dataset with LLM extraction tools. For questions or improvements:

  • Review existing documentation in om_qex_extraction/
  • Check TEST_RESULTS.md for known issues and improvement roadmap
  • Follow TESTING_WORKFLOW.md for testing changes

πŸ“„ License

[Add license information]


Last updated: November 10, 2025
Dataset version: 95 studies (duplicate removed)
Extraction system: Dual-mode (OM + QEX) with two-stage pipeline established
Baseline performance: 14 outcomes/paper, 118% improvement over standalone QEX
Status: Ready for prompt engineering optimization
Repository: https://github.com/lsempe77/OM_QEX

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages