Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ data/variantAnnotations/
data/unique_pmcids.json
data/pmid_list.json
data/downloaded_pmcids.json
data/markdown

*.zip
*.tar.gz
Expand Down
4 changes: 2 additions & 2 deletions README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ This repository contains Python scripts for running and building a Pharmacogenom
| | Convert the PMID to PMCID | ✅ |
| | Update to use non-official pmid to pmcid (aaron's method) | |
| | Fetch the content from the PMCID | ✅ |
| Benchmark | Create pairings of annotations to articles | |
| Benchmark | Create pairings of annotations to articles | |
| | Create a niave score of number of matches | |
| | Create group wise score | |
| | Look into advanced scoring based on distance from truth per term | |
| Workflows | Integrate Aaron's current approach | |
| Workflows | Integrate Aaron's current approach | |
| | Document on individual annotation meanings | |
| | Delegate annotation groupings to team members | |
| New Article Fetching | Replicate PharGKB current workflow | |
Expand Down
8 changes: 6 additions & 2 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This directory contains the primary data files used by the AutoGKB project.

## Directory Structure

- **articles/** - Contains XML files of articles from PubMed Central (PMC), identified by their PMCID (e.g., PMC1234567.xml). These articles are used for text mining and information extraction.
- **articles/** - Contains markdown files of articles from PubMed Central (PMC), identified by their PMCID (e.g., PMC1234567.xml). These articles are used for text mining and information extraction.

- **variantAnnotations/** - Contains clinical variant annotations and related data:
- `var_drug_ann.tsv` - Variant-drug annotations. This is what is used in this repo.
Expand All @@ -14,4 +14,8 @@ This directory contains the primary data files used by the AutoGKB project.
- `pmcid_mapping.json` - Maps between PMIDs and PMCIDs
- `unique_pmcids.json` - List of unique PMCIDs in the dataset
- `pmid_list.json` - List of PMIDs in the dataset
- `downloaded_pmcids.json` - Tracking which PMCIDs have been downloaded
- `downloaded_pmcids.json` - Tracking which PMCIDs have been downloaded

- **benchmark**
- `train, test, and val.json` - splits that contain all the data in jsonl files
- `column_mapping.json1` - Maps the column headers from the original var_drug_ann.tsv to the keys in the benchmark jsonl files
27 changes: 27 additions & 0 deletions data/benchmark/column_mapping.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"pmcid": "pmcid",
"article_title": "article_title",
"article_path": "article_path",
"Variant Annotation ID": "variant_annotation_id",
"Variant/Haplotypes": "variant_haplotypes",
"Gene": "gene",
"Drug(s)": "drugs",
"PMID": "pmid",
"Phenotype Category": "phenotype_category",
"Significance": "significance",
"Notes": "notes",
"Sentence": "sentence",
"Alleles": "alleles",
"Specialty Population": "specialty_population",
"Metabolizer types": "metabolizer_types",
"isPlural": "is_plural",
"Is/Is Not associated": "is_is_not_associated",
"Direction of effect": "direction_of_effect",
"PD/PK terms": "pd_pk_terms",
"Multiple drugs And/or": "multiple_drugs_and_or",
"Population types": "population_types",
"Population Phenotypes or diseases": "population_phenotypes_or_diseases",
"Multiple phenotypes or diseases And/or": "multiple_phenotypes_or_diseases_and_or",
"Comparison Allele(s) or Genotype(s)": "comparison_alleles_or_genotypes",
"Comparison Metabolizer types": "comparison_metabolizer_types"
}
453 changes: 453 additions & 0 deletions data/benchmark/test.jsonl

Large diffs are not rendered by default.

3,612 changes: 3,612 additions & 0 deletions data/benchmark/train.jsonl

Large diffs are not rendered by default.

451 changes: 451 additions & 0 deletions data/benchmark/val.jsonl

Large diffs are not rendered by default.

69 changes: 69 additions & 0 deletions docs/duplicate_pmids.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Duplicate PMIDs and Data Structure Explanation

## Overview

This document explains why there are fewer unique markdown file names than entries in `parsed_drug_annotations.jsonl` and clarifies the data structure regarding duplicate variant annotations.

## Data Structure Summary

| Data Source | Count | Description |
|-------------|-------|-------------|
| `var_drug_ann.tsv` | 12,474 entries | Original variant annotation entries |
| Unique PMIDs in original data | 4,262 | Unique research papers |
| Available markdown files | 1,432 | Papers with full text available |
| `parsed_drug_annotations.jsonl` | 4,516 entries | Annotations with paper content found |
| Unique PMIDs with paper content | ~1,431 | Unique papers that were successfully processed |

## Why There Are "Duplicate" PMIDs

### Multiple Variant Annotations Per Paper

Many research papers study multiple genetic variants and their associations with drugs. Each variant gets its own annotation entry, even though they come from the same paper.

**Example from PMID 39792745:**
- `rs2909451` in `DPP4` gene for sitagliptin efficacy
- `rs2285676` in `KCNJ11` gene for sitagliptin efficacy
- `rs163184` in `KCNQ1` gene for sitagliptin efficacy
- `rs4664443` in `DPP4` gene for sitagliptin efficacy
- `rs1799853` in `CYP2C9` gene for sitagliptin efficacy
- And several others...

### Data Processing Approach

The conversion process in `convert_to_jsonl.ipynb`:

1. **Processes each annotation individually** - Each row in the TSV becomes one entry
2. **Adds paper content to each entry** - The full markdown content is attached to every variant annotation from the same paper
3. **Preserves granular annotations** - Each variant-drug association remains as a separate data point

## Implications

### Storage Efficiency
- The same paper content is duplicated across multiple entries
- On average, each paper has ~3-4 variant annotations
- This results in significant data duplication but preserves analytical granularity

### Analysis Benefits
- Each variant annotation can be analyzed independently
- Full paper context is available for each genetic association
- Researchers can study variant-specific effects while having access to the complete source material

### File Mapping
- 1,432 unique markdown files map to 4,516 annotation entries
- Not all PMIDs have corresponding markdown files (some papers may not have been successfully downloaded or processed)
- Only 4,516 out of 12,474 original annotations have paper content (36% success rate)

## Data Integrity

This structure is **intentional and correct**:
- ✅ Each variant annotation is treated as an independent data point
- ✅ Full paper context is preserved for each annotation
- ✅ Researchers can filter by specific variants, genes, or drugs while maintaining paper context
- ✅ The relationship between annotations and source papers is maintained via PMID

## Usage Recommendations

When analyzing this data:
- **For paper-level analysis**: Group by PMID to avoid counting papers multiple times
- **For variant-level analysis**: Use entries directly as each represents a unique genetic association
- **For summary statistics**: Be aware that paper counts and annotation counts are different metrics
Loading
Loading