DaneshjouLab · shloknatarajan · Jun 6, 2025 · Jun 6, 2025 · Jun 6, 2025
diff --git a/.gitignore b/.gitignore
@@ -24,6 +24,7 @@ data/variantAnnotations/
 data/unique_pmcids.json
 data/pmid_list.json
 data/downloaded_pmcids.json
+data/markdown
 
 *.zip
 *.tar.gz

diff --git a/README.MD b/README.MD
@@ -28,11 +28,11 @@ This repository contains Python scripts for running and building a Pharmacogenom
 |                  | Convert the PMID to PMCID | ✅ |
 |                  | Update to use non-official pmid to pmcid (aaron's method) | |
 |                  | Fetch the content from the PMCID | ✅ |
-| Benchmark        | Create pairings of annotations to articles | |
+| Benchmark        | Create pairings of annotations to articles | ✅ |
 |                  | Create a niave score of number of matches | |
 |                  | Create group wise score | |
 |                  | Look into advanced scoring based on distance from truth per term | |
-| Workflows        | Integrate Aaron's current approach | |
+| Workflows        | Integrate Aaron's current approach | ✅ |
 |                  | Document on individual annotation meanings | |
 |                  | Delegate annotation groupings to team members | |
 | New Article Fetching | Replicate PharGKB current workflow | |

diff --git a/data/README.md b/data/README.md
@@ -4,7 +4,7 @@ This directory contains the primary data files used by the AutoGKB project.
 
 ## Directory Structure
 
-- **articles/** - Contains XML files of articles from PubMed Central (PMC), identified by their PMCID (e.g., PMC1234567.xml). These articles are used for text mining and information extraction.
+- **articles/** - Contains markdown files of articles from PubMed Central (PMC), identified by their PMCID (e.g., PMC1234567.xml). These articles are used for text mining and information extraction.
 
 - **variantAnnotations/** - Contains clinical variant annotations and related data:
   - `var_drug_ann.tsv` - Variant-drug annotations. This is what is used in this repo.
@@ -14,4 +14,8 @@ This directory contains the primary data files used by the AutoGKB project.
   - `pmcid_mapping.json` - Maps between PMIDs and PMCIDs
   - `unique_pmcids.json` - List of unique PMCIDs in the dataset
   - `pmid_list.json` - List of PMIDs in the dataset
-  - `downloaded_pmcids.json` - Tracking which PMCIDs have been downloaded
+  - `downloaded_pmcids.json` - Tracking which PMCIDs have been downloaded
+
+- **benchmark**
+  - `train, test, and val.json`  - splits that contain all the data in jsonl files
+  - `column_mapping.json1` - Maps the column headers from the original var_drug_ann.tsv to the keys in the benchmark jsonl files
diff --git a/data/benchmark/column_mapping.json b/data/benchmark/column_mapping.json
@@ -0,0 +1,27 @@
+{
+    "pmcid": "pmcid",
+    "article_title": "article_title",
+    "article_path": "article_path",
+    "Variant Annotation ID": "variant_annotation_id",
+    "Variant/Haplotypes": "variant_haplotypes",
+    "Gene": "gene",
+    "Drug(s)": "drugs",
+    "PMID": "pmid",
+    "Phenotype Category": "phenotype_category", 
+    "Significance": "significance",
+    "Notes": "notes",
+    "Sentence": "sentence",
+    "Alleles": "alleles",
+    "Specialty Population": "specialty_population",
+    "Metabolizer types": "metabolizer_types",
+    "isPlural": "is_plural",
+    "Is/Is Not associated": "is_is_not_associated",
+    "Direction of effect": "direction_of_effect",
+    "PD/PK terms": "pd_pk_terms",
+    "Multiple drugs And/or": "multiple_drugs_and_or",
+    "Population types": "population_types",
+    "Population Phenotypes or diseases": "population_phenotypes_or_diseases",
+    "Multiple phenotypes or diseases And/or": "multiple_phenotypes_or_diseases_and_or",
+    "Comparison Allele(s) or Genotype(s)": "comparison_alleles_or_genotypes",
+    "Comparison Metabolizer types": "comparison_metabolizer_types"
+}
diff --git a/data/benchmark/test.jsonl b/data/benchmark/test.jsonl
diff --git a/data/benchmark/train.jsonl b/data/benchmark/train.jsonl
diff --git a/data/benchmark/val.jsonl b/data/benchmark/val.jsonl
diff --git a/docs/duplicate_pmids.md b/docs/duplicate_pmids.md
@@ -0,0 +1,69 @@
+# Duplicate PMIDs and Data Structure Explanation
+
+## Overview
+
+This document explains why there are fewer unique markdown file names than entries in `parsed_drug_annotations.jsonl` and clarifies the data structure regarding duplicate variant annotations.
+
+## Data Structure Summary
+
+| Data Source | Count | Description |
+|-------------|-------|-------------|
+| `var_drug_ann.tsv` | 12,474 entries | Original variant annotation entries |
+| Unique PMIDs in original data | 4,262 | Unique research papers |
+| Available markdown files | 1,432 | Papers with full text available |
+| `parsed_drug_annotations.jsonl` | 4,516 entries | Annotations with paper content found |
+| Unique PMIDs with paper content | ~1,431 | Unique papers that were successfully processed |
+
+## Why There Are "Duplicate" PMIDs
+
+### Multiple Variant Annotations Per Paper
+
+Many research papers study multiple genetic variants and their associations with drugs. Each variant gets its own annotation entry, even though they come from the same paper.
+
+**Example from PMID 39792745:**
+- `rs2909451` in `DPP4` gene for sitagliptin efficacy
+- `rs2285676` in `KCNJ11` gene for sitagliptin efficacy  
+- `rs163184` in `KCNQ1` gene for sitagliptin efficacy
+- `rs4664443` in `DPP4` gene for sitagliptin efficacy
+- `rs1799853` in `CYP2C9` gene for sitagliptin efficacy
+- And several others...
+
+### Data Processing Approach
+
+The conversion process in `convert_to_jsonl.ipynb`:
+
+1. **Processes each annotation individually** - Each row in the TSV becomes one entry
+2. **Adds paper content to each entry** - The full markdown content is attached to every variant annotation from the same paper
+3. **Preserves granular annotations** - Each variant-drug association remains as a separate data point
+
+## Implications
+
+### Storage Efficiency
+- The same paper content is duplicated across multiple entries
+- On average, each paper has ~3-4 variant annotations
+- This results in significant data duplication but preserves analytical granularity
+
+### Analysis Benefits
+- Each variant annotation can be analyzed independently
+- Full paper context is available for each genetic association
+- Researchers can study variant-specific effects while having access to the complete source material
+
+### File Mapping
+- 1,432 unique markdown files map to 4,516 annotation entries
+- Not all PMIDs have corresponding markdown files (some papers may not have been successfully downloaded or processed)
+- Only 4,516 out of 12,474 original annotations have paper content (36% success rate)
+
+## Data Integrity
+
+This structure is **intentional and correct**:
+- ✅ Each variant annotation is treated as an independent data point
+- ✅ Full paper context is preserved for each annotation
+- ✅ Researchers can filter by specific variants, genes, or drugs while maintaining paper context
+- ✅ The relationship between annotations and source papers is maintained via PMID
+
+## Usage Recommendations
+
+When analyzing this data:
+- **For paper-level analysis**: Group by PMID to avoid counting papers multiple times
+- **For variant-level analysis**: Use entries directly as each represents a unique genetic association
+- **For summary statistics**: Be aware that paper counts and annotation counts are different metrics