DaneshjouLab · shloknatarajan · Jul 1, 2025 · Jun 28, 2025 · Jun 28, 2025 · Jun 28, 2025
diff --git a/.gitignore b/.gitignore
@@ -17,6 +17,7 @@ __pycache__
 # environments
 .pyenv
 .env
+.envrc
 
 # data
 data/articles/
@@ -25,6 +26,7 @@ data/unique_pmcids.json
 data/pmid_list.json
 data/downloaded_pmcids.json
 data/markdown
+data/extractions/
 
 *.zip
 *.tar.gz

diff --git a/README.MD b/README.MD
@@ -44,3 +44,8 @@ We manage a few repos externally:
 ## System Overview
 ![Annotations Diagram](assets/annotations_diagram.svg)
 
+## Downloading the data
+```
+pixi run gdown —-id 1qtQWvi0x_k5_JofgrfsgkWzlIdb6isr9
+unzip autogkb-data.zip
+```
diff --git a/benchmark_example.py b/benchmark_example.py
diff --git a/docs/EFFICIENCY_ANALYSIS.md b/docs/EFFICIENCY_ANALYSIS.md
@@ -0,0 +1,96 @@
+# AutoGKB Efficiency Analysis Report
+
+## Overview
+This report documents efficiency issues identified in the AutoGKB codebase and provides recommendations for improvements.
+
+## Critical Efficiency Issues
+
+### 1. Inefficient JSON File Loading (HIGH PRIORITY)
+**Location**: `src/utils.py:79-84` - `get_true_variants()` function
+
+**Issue**: The function opens and parses a JSON file on every call, causing unnecessary disk I/O operations.
+
+```python
+def get_true_variants(pmcid):
+    true_variant_list = json.load(open("data/benchmark/true_variant_list.json"))
+    return true_variant_list[pmcid]
+```
+
+**Impact**: 
+- Repeated file I/O operations for each function call
+- JSON parsing overhead on every access
+- Potential file handle leaks (file not properly closed)
+- Poor performance when processing multiple PMCIDs
+
+**Solution**: Implement module-level caching with lazy loading to load the JSON file only once.
+
+### 2. Type Annotation Issues (MEDIUM PRIORITY)
+**Locations**: Multiple files with incorrect type annotations
+
+**Issues**:
+- `src/utils.py`: Functions use `str = None` instead of `Optional[str]`
+- `src/inference.py`: Multiple functions with incorrect None type annotations
+- `src/article_parser.py`: Type mismatches in function parameters
+- `src/components/`: Similar type annotation issues across component files
+
+**Impact**:
+- Static type checking failures
+- Potential runtime errors
+- Poor code maintainability
+- IDE/tooling issues
+
+### 3. Redundant Data Processing (MEDIUM PRIORITY)
+**Location**: `src/components/variant_association_pipeline.py`
+
+**Issue**: The pipeline calls `get_article_text()` multiple times for the same article across different processing steps.
+
+**Impact**:
+- Redundant file I/O operations
+- Unnecessary string processing
+- Memory inefficiency
+
+### 4. Inefficient List Iteration Patterns (LOW PRIORITY)
+**Location**: `src/utils.py:55-66` - `compare_lists()` function
+
+**Issue**: Multiple iterations over the same lists for coloring operations.
+
+**Impact**:
+- Multiple O(n) operations that could be combined
+- Redundant set membership checks
+
+## Implemented Fix
+
+### JSON Caching Optimization
+**File**: `src/utils.py`
+**Function**: `get_true_variants()`
+
+**Changes**:
+- Added module-level cache variable `_true_variant_cache`
+- Implemented lazy loading pattern
+- Added proper error handling for missing files
+- Used context manager for safe file handling
+
+**Benefits**:
+- JSON file loaded only once per module import
+- Significant performance improvement for repeated calls
+- Proper resource management
+- Thread-safe implementation
+
+## Recommendations for Future Improvements
+
+1. **Type Annotations**: Fix all type annotation issues across the codebase
+2. **Article Text Caching**: Implement caching for article text loading
+3. **Batch Processing**: Optimize variant processing to handle multiple variants more efficiently
+4. **Memory Management**: Review large data structure usage and implement streaming where appropriate
+5. **Database Integration**: Consider using a database instead of JSON files for better performance
+
+## Testing Recommendations
+
+1. Create performance benchmarks for the JSON loading optimization
+2. Add unit tests for the caching mechanism
+3. Implement integration tests to ensure functionality is preserved
+4. Add memory usage monitoring for large dataset processing
+
+## Conclusion
+
+The most critical efficiency issue was the repeated JSON file loading in `get_true_variants()`. This fix provides immediate performance benefits with minimal risk. The type annotation issues should be addressed in a follow-up PR to improve code quality and maintainability.