A curated dataset of 114 studies on poverty graduation programs with full-text extractions and LLM-based data extraction tools.
π Comprehensive Documentation: See docs/ for technical reports and performance analysis
β Dataset Status (Nov 11, 2025): 114 studies ready for extraction | All GROBID outputs complete
OM_QEX/
βββ data/
β βββ raw/ # Master CSV (114 studies) + fulltext metadata (673 papers)
β βββ human_extraction/ # Manual extractions (ground truth)
β βββ grobid_outputs/ # 114 studies Γ 2 formats (TEI XML + TXT) β
β βββ pdfs_from_zotero/ # 19 PDFs downloaded from Zotero (archived source)
βββ om_qex_extraction/ # π LLM-based extraction app
β βββ src/ # Extraction engine and parsers
β βββ prompts/ # LLM extraction prompts
β βββ config/ # Configuration (API keys)
β βββ outputs/ # Extracted data (JSON + CSV)
βββ docs/ # π Documentation
β βββ BASELINE_PERFORMANCE_REPORT.md # Technical performance analysis
β βββ BASELINE_RESULTS_EMAIL.md # Stakeholder summary
β βββ HUMAN_COMPARISON_RESULTS.md # LLM vs human comparison
β βββ CLEANUP_LOG.md # Project organization log
βββ scripts/ # Data processing utilities (map IDs to keys, copy files)
βββ archive/ # Historical files and one-time scripts
β βββ zotero_sync_nov11/ # Zotero sync scripts, logs, and verification tools
βββ find_missing_in_zotero.py # Find studies in Zotero by EPPI-Reviewer ID
βββ download_missing_pdfs.py # Download PDFs from Zotero library
114 included studies on poverty graduation and ultra-poor programs - All ready for extraction β
- Expanded from 95 β 114 studies (+19 studies)
- All 114 studies processed through GROBID
- 114 TEI XML files + 114 TXT files ready
- Master file (n=114) - Primary dataset with study metadata
- fulltext_metadata (673 entries) - Maps all 114 study IDs to GROBID file Keys
- Manual data extraction - Ground truth for comparison with LLM extraction
- QEX validation:
8 week SR QEX Pierre SOF and TEEP(Quant Extraction Form).csv(3 studies, detailed fields) - OM validation:
OM_human_extraction.csv(9 valid studies, 57 outcomes total)
- QEX validation:
- Prompt engineering input - Reference data for developing extraction prompts
- Quality benchmark - Validation standard for automated extraction
data/README.md for details.
- tei/ - 114 TEI XML files (structured with sections, references, metadata)
- text/ - 114 plain text files (cleaned full-text extraction)
- All files linked via Keys in fulltext_metadata.csv
Automated outcome extraction from research papers using LLMs with dual-mode operation.
1. OM (Outcome Mapping) - Comprehensive outcome identification
- Identifies ALL outcomes with statistical analysis
- Simple categorization (outcome_group, outcome_category, location)
- Output: ~14 outcomes per paper
- Use case: Systematic review mapping, outcome inventory
2. QEX (Quantitative Extraction) - Detailed statistical extraction
- Extracts complete statistical data for outcomes
- Full details (effect_size, p_value, sample_sizes, graduation_components)
- Output: Detailed data for meta-analysis
- Use case: Statistical synthesis, detailed data extraction
3. Two-Stage Pipeline - OM guides QEX for maximum coverage
- Stage 1 (OM): Find all outcomes with locations
- Stage 2 (QEX): Extract details using OM hints
- Result: 118% more outcomes than standalone QEX
- 100% OMβQEX conversion rate
cd om_qex_extraction
# Run two-stage extraction (recommended)
python run_twostage_extraction.py --keys PHRKN65M
# Or run modes separately:
python run_extraction.py --mode om --keys PHRKN65M # Find all outcomes
python run_extraction.py --mode qex --keys PHRKN65M # Extract detailsTest Paper: PHRKN65M (Burchi & Strupat 2018, Malawi TEEP)
| Approach | Outcomes Found | Tables Covered | vs Human (9 outcomes) |
|---|---|---|---|
| Human extraction | 9 | Tables 6,8,11,13,15,16,17 | Baseline |
| Regular QEX | 6 | Tables 4,6,7,8,9,10,11 | 67% coverage |
| Two-stage (OMβQEX) | 14 | Tables 5,6,7,9,10,12,13,15,17,18 | 156% of human |
| Improved OM (v2) | 14 | 10 tables | +56% vs human |
Key Findings:
- β Two-stage pipeline: 118% improvement over regular QEX (6β14 outcomes)
- β 100% OMβQEX conversion rate (all identified outcomes extracted)
β οΈ Different table selection than human (57% overlap)β οΈ Paper has 22 results tables - both human and LLM select subsets- π LLM found 6 additional tables human didn't extract
Verification fields added:
literal_text: Exact quote from paper for manual verificationtext_position: Precise location (Table X, Row Y, Column Z)
- HUMAN_COMPARISON_RESULTS.md - Detailed comparison analysis β NEW
- TESTING_WORKFLOW.md - Complete testing guide
- TEST_RESULTS.md - Historical baseline results
- QUICK_REFERENCE.md - Commands cheat sheet
- Architecture: Dual-mode (OM + QEX) with two-stage pipeline
- Model: Claude 3.5 Haiku via OpenRouter API
- Extraction: Working end-to-end with network retry logic
- Test papers: 2 papers analyzed (PHRKN65M, ABM3E3ZP)
- Baseline: 14 outcomes per paper (56% more than human extraction)
- Coverage: Finding different tables than human - not necessarily worse
- Precision: To be validated (next step)
- Status: Ready for prompt engineering improvements
- β Dual-mode extraction: OM (outcome mapping) + QEX (quantitative extraction)
- β Two-stage pipeline: OM guides QEX for 118% better coverage
- β TEI XML parser for GROBID outputs
- β Verification fields (literal_text, text_position) for manual checking
- β Batch processing with robust network retry logic
- β JSON + CSV output formats
- β Handles complex multi-outcome papers (10-20+ outcomes per paper)
- β Comprehensive results scanning (continues through entire paper)
OM (Outcome Mapping) Fields:
- outcome_group (high-level category: Poverty, Income, Assets, etc.)
- outcome_category (specific outcome name)
- location (page, table, section reference)
- literal_text (exact quote from paper)
- text_position (precise location for verification)
QEX (Quantitative Extraction) Fields:
- All OM fields plus:
- outcome_description, evaluation_design, sample_sizes
- effect_size, standard_error, p_value, confidence_interval
- graduation_components (7 components: consumption, healthcare, assets, skills, savings, coaching, social)
See om_qex_extraction/prompts/ for prompt templates.
Utility scripts for data management and analysis:
map_ids_to_keys.py- Maps study IDs to GROBID Keys for extractioncopy_files_by_key.py- Extracts GROBID outputs for specific paper Keys
Download PDFs from Zotero library for studies missing GROBID outputs:
-
find_missing_in_zotero.py- Search Zotero library for studies by EPPI-Reviewer ID- Searches 1,600+ Zotero items
- Matches via
extrafield containing study IDs - Outputs mapping: Study ID β Zotero Key β PDF status
-
download_missing_pdfs.py- Download PDFs from Zotero- Uses mapping CSV from
find_missing_in_zotero.py - Downloads only missing PDFs (skips existing)
- Saves to
data/pdfs_from_zotero/
- Uses mapping CSV from
Usage:
# 1. Find studies in Zotero
python find_missing_in_zotero.py
# 2. Download PDFs
python download_missing_pdfs.pySee archive/zotero_sync_nov11/README.md for full documentation.
Historical diagnostic scripts used during data cleaning and expansion:
Data Cleaning (Oct-Nov 2025):
find_duplicate_keys.py- Found duplicate study (121475488) sharing same Keyremove_duplicate.py- Cleaned master file from 96 β 95 studiestest_stem.py- Diagnosed Path.stem behavior with .tei.xml files
Dataset Expansion (Nov 11, 2025):
analyze_raw_files.py- Analyzed Master CSV and fulltext_metadata relationshipcheck_pdf_coverage.py- Calculated initial PDF coverage (97/114)find_missing_files.py- Identified 2 missing GROBID filesverify_extraction_ready.py- Final verification (114/114 complete)- Full logs and mapping files in
archive/zotero_sync_nov11/
These scripts are archived for reference but not needed for normal use.
# Clone the repository
git clone https://github.com/lsempe77/OM_QEX.git
cd OM_QEX
# View master dataset
cat "data/raw/Master file of included studies (n=114) 11 Nov(data).csv"
# Access full-text files
# data/grobid_outputs/tei/ (95 TEI XML files - structured)
# data/grobid_outputs/text/ (95 TXT files - plain text)cd om_qex_extraction
# Two-stage extraction (recommended - best coverage)
python run_twostage_extraction.py --keys PHRKN65M
# Or run modes separately:
python run_extraction.py --mode om --keys PHRKN65M # Find all outcomes
python run_extraction.py --mode qex --keys PHRKN65M # Extract detailed stats
# Compare with human ground truth
python compare_om_extractions.py # Validate outcome identification
python compare_extractions.py # Validate detailed extraction
# View results
python -c "import pandas as pd; df = pd.read_csv('outputs/twostage/stage2_qex/extracted_data.csv'); print(df[['outcome_category', 'literal_text', 'text_position']])"Current baseline: 14 outcomes per paper, 56% more than human extraction.
See HUMAN_COMPARISON_RESULTS.md for detailed analysis.
cd om_qex_extraction
# Setup API key (first time only)
cp config/config.yaml.template config/config.yaml
# Edit config.yaml and add your OpenRouter API key
# Install dependencies
pip install -r requirements.txt
# Extract all 95 papers (~10-15 min, ~$0.50-1.00)
python run_extraction.py --allPapers have two identifiers:
- Study ID (e.g., 121058352) - Used in master file and human extraction
- Key (e.g., CV27ZK8Q) - Used for GROBID filenames
To find GROBID files for a paper:
- Look up Study ID in
data/raw/fulltext_metadata.csv - Find corresponding Key in the same row
- Access files:
data/grobid_outputs/tei/[Key].tei.xmlanddata/grobid_outputs/text/[Key].txt
Example:
Study ID: 121058352 (Bandiera 2009)
β fulltext_metadata.csv: Key = CV27ZK8Q
β Files: CV27ZK8Q.tei.xml, CV27ZK8Q.txt
Shortcut: Map study IDs to Keys using scripts/map_ids_to_keys.py
For LLM validation, 3 studies have manual human extraction:
| Study ID | Key | Author | Year | Program | Country | Status |
|---|---|---|---|---|---|---|
| 121294984 | PHRKN65M | Burchi & Strupat | 2018 | TEEP | Malawi | β In master (9 outcomes) |
| 121058364 | ABM3E3ZP | Maldonado et al. | 2019 | SOF | Paraguay | β In master |
| 121498842 | - | Mahecha et al. | - | SOF | Paraguay | β Not in master |
Only 2/3 papers can be tested (121498842 was excluded from final dataset).
See om_qex_extraction/TESTING_WORKFLOW.md for testing details.
- Dataset: 95 poverty graduation program studies (cleaned from 96 - duplicate removed)
- Full-text processing: GROBID PDF extraction β TEI XML + plain text
- ID linking: All Study IDs mapped to Keys via fulltext_metadata.csv
- LLM extraction: Claude 3.5 Haiku via OpenRouter API
- Extraction modes: OM (outcome mapping) + QEX (quantitative extraction) + Two-stage pipeline
- Baseline performance: 14 outcomes per paper (56% more than human extraction)
- Coverage: Different table selection than human (57% overlap, 6 additional tables found)
- Status: System working, baseline established, ready for prompt engineering improvements
- Next steps: Improve table coverage (missing 3/7 human-selected tables), validate precision
OM_QEX/
βββ README.md # This file - project overview
βββ DOCUMENTATION_UPDATE.md # Documentation changelog (Nov 10, 2025)
βββ EXTRACTION_PLAN.md # Original extraction planning document
βββ .gitignore # Git ignore rules
β
βββ data/ # Dataset files
β βββ README.md # Data documentation with test papers
β βββ raw/ # Metadata CSVs
β β βββ Master file (n=114).csv # Primary dataset β
β β βββ fulltext_metadata.csv # ID β Key mapping
β βββ human_extraction/ # Ground truth (3 studies, 2 in master)
β βββ grobid_outputs/ # Full-text extractions (114 Γ 2)
β βββ tei/ # TEI XML (structured)
β βββ text/ # Plain text
β
βββ om_qex_extraction/ # LLM extraction application β
β βββ README.md # App documentation
β βββ TESTING_WORKFLOW.md # Complete testing guide
β βββ TEST_RESULTS.md # Current baseline & findings
β βββ QUICK_REFERENCE.md # Commands cheat sheet
β βββ COMPARISON_GUIDE.md # Understanding results
β βββ EXTRACTION_READY.md # System documentation
β βββ run_extraction.py # Main extraction CLI
β βββ compare_extractions.py # LLM vs human comparison
β βββ requirements.txt # Python dependencies
β βββ src/ # Source code
β β βββ models.py # Pydantic data models
β β βββ tei_parser.py # TEI XML parser
β β βββ extraction_engine.py # LLM extraction logic
β β βββ comparer.py # Comparison system
β βββ prompts/ # LLM prompts
β βββ config/ # Configuration files
β βββ outputs/ # Extraction results (gitignored)
β
βββ scripts/ # Utility scripts
β βββ add_key_column.py # ID β Key mapping
β βββ copy_files_by_key.py # File extraction
β βββ get_human_study_ids.py # List test papers
β βββ map_ids_to_keys.py # ID β Key lookup
β
βββ archive/ # Diagnostic scripts (historical)
βββ find_duplicate_keys.py # Found duplicate study
βββ remove_duplicate.py # Cleaned master file
βββ test_stem.py # Path.stem diagnostics
βββ ... # Other data cleaning tools
- Start here:
om_qex_extraction/TESTING_WORKFLOW.md - Master dataset:
data/raw/Master file of included studies (n=114) 11 Nov(data).csv - Test results:
om_qex_extraction/TEST_RESULTS.md - Run extraction:
om_qex_extraction/run_extraction.py - Compare results:
om_qex_extraction/compare_extractions.py
This is a research dataset with LLM extraction tools. For questions or improvements:
- Review existing documentation in
om_qex_extraction/ - Check
TEST_RESULTS.mdfor known issues and improvement roadmap - Follow
TESTING_WORKFLOW.mdfor testing changes
[Add license information]
Last updated: November 10, 2025
Dataset version: 95 studies (duplicate removed)
Extraction system: Dual-mode (OM + QEX) with two-stage pipeline established
Baseline performance: 14 outcomes/paper, 118% improvement over standalone QEX
Status: Ready for prompt engineering optimization
Repository: https://github.com/lsempe77/OM_QEX