Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
644d967
feat: efficiency improvements - cache JSON loading and fix type annot…
devin-ai-integration[bot] Jun 28, 2025
0c9b458
Add linux-64 platform support for Devin setup
shloknatarajan Jun 28, 2025
4d2cdf1
Add drug annotation extraction component
devin-ai-integration[bot] Jun 28, 2025
0b90bdf
Update drug annotation extraction to process variants individually
devin-ai-integration[bot] Jun 28, 2025
4630fc4
Merge pull request #2 from shloknatarajan/devin/1751131840-drug-annot…
shloknatarajan Jun 28, 2025
5d3106b
Merge branch 'main' into devin/1751130154-efficiency-improvements
shloknatarajan Jun 28, 2025
883c836
Merge pull request #1 from shloknatarajan/devin/1751130154-efficiency…
shloknatarajan Jun 28, 2025
296b6f6
feat: gdown data downloading
shloknatarajan Jun 28, 2025
2d8a132
feat: gdown pixi command
shloknatarajan Jun 28, 2025
cae9aaf
fix: updated command
shloknatarajan Jun 28, 2025
6cd116c
feat: envrc gitignore
shloknatarajan Jun 28, 2025
07507a3
fix: remove zip after unzipping
shloknatarajan Jun 28, 2025
ed1ca8d
chore: black formatting
shloknatarajan Jun 28, 2025
a6449e6
feat: implement phenotype and functional annotation extraction compon…
devin-ai-integration[bot] Jun 28, 2025
f243af9
test: move test files to tests/ folder for better organization
devin-ai-integration[bot] Jun 28, 2025
54fe073
fix: update import paths for tests moved to tests/ folder
devin-ai-integration[bot] Jun 28, 2025
f1f0021
Merge pull request #3 from shloknatarajan/devin/1751134713-phenotype-…
shloknatarajan Jun 28, 2025
2deec50
feat: moves tests into folder
shloknatarajan Jun 29, 2025
fdd46ec
feat: basic fuser, all_associations (both untested)
shloknatarajan Jun 30, 2025
49893f6
chore: comment
shloknatarajan Jun 30, 2025
dbe40f3
feat: all associations prompt updates
shloknatarajan Jul 1, 2025
b5c91ef
feat: drug annotation prompt updates
shloknatarajan Jul 1, 2025
6b9a673
chore: moved old components to deprecated folder
shloknatarajan Jul 1, 2025
3cf9c5b
fix: deprecated imports
shloknatarajan Jul 1, 2025
f55ffe9
feat: file movements and started phenotype annotation
shloknatarajan Jul 1, 2025
941f1cf
feat: phenotype annotation prompt
shloknatarajan Jul 1, 2025
505bce6
feat: FA and removed old main code
shloknatarajan Jul 1, 2025
cf14e9d
fix: removed unused tests
shloknatarajan Jul 1, 2025
afce8b2
feat: study parameters prompt
shloknatarajan Jul 1, 2025
fea928e
fix: removed old testing function
shloknatarajan Jul 1, 2025
c4baaee
fix: updated inference types
shloknatarajan Jul 1, 2025
05121c0
fix: updated all inference types
shloknatarajan Jul 1, 2025
43b342e
checkpoint: debugging all associations run
shloknatarajan Jul 1, 2025
f985500
feat: working get all associations
shloknatarajan Jul 1, 2025
72c32b9
checkpoint: json output of all associations + generator
shloknatarajan Jul 1, 2025
fb76426
chore: types and gitignore
shloknatarajan Jul 1, 2025
04b02c6
checkpoint: almost working drug annotation
shloknatarajan Jul 1, 2025
0f3fdf4
feat: working drug annotation extraction
shloknatarajan Jul 1, 2025
be32867
feat: untested functional annotation
shloknatarajan Jul 1, 2025
91ea3d7
chore: black formatting
shloknatarajan Jul 1, 2025
62f051f
feat: untested study parameters
shloknatarajan Jul 1, 2025
7922fa0
feat: (untested) complete annotation generation pipeline
shloknatarajan Jul 1, 2025
8a9f821
feat: final annotation saving (untested)
shloknatarajan Jul 1, 2025
dfae818
chore: black
shloknatarajan Jul 1, 2025
5f86d89
Merge branch 'main' of https://github.com/daneshjoulab/autogkb
shloknatarajan Jul 1, 2025
8bdf492
feat: full working pipeline run
shloknatarajan Jul 1, 2025
a6d05de
chore: black formatting
shloknatarajan Jul 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ __pycache__
# environments
.pyenv
.env
.envrc

# data
data/articles/
Expand All @@ -25,6 +26,7 @@ data/unique_pmcids.json
data/pmid_list.json
data/downloaded_pmcids.json
data/markdown
data/extractions/

*.zip
*.tar.gz
Expand Down
5 changes: 5 additions & 0 deletions README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,8 @@ We manage a few repos externally:
## System Overview
![Annotations Diagram](assets/annotations_diagram.svg)

## Downloading the data
```
pixi run gdown —-id 1qtQWvi0x_k5_JofgrfsgkWzlIdb6isr9
unzip autogkb-data.zip
```
274 changes: 0 additions & 274 deletions benchmark_example.py

This file was deleted.

96 changes: 96 additions & 0 deletions docs/EFFICIENCY_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# AutoGKB Efficiency Analysis Report

## Overview
This report documents efficiency issues identified in the AutoGKB codebase and provides recommendations for improvements.

## Critical Efficiency Issues

### 1. Inefficient JSON File Loading (HIGH PRIORITY)
**Location**: `src/utils.py:79-84` - `get_true_variants()` function

**Issue**: The function opens and parses a JSON file on every call, causing unnecessary disk I/O operations.

```python
def get_true_variants(pmcid):
true_variant_list = json.load(open("data/benchmark/true_variant_list.json"))
return true_variant_list[pmcid]
```

**Impact**:
- Repeated file I/O operations for each function call
- JSON parsing overhead on every access
- Potential file handle leaks (file not properly closed)
- Poor performance when processing multiple PMCIDs

**Solution**: Implement module-level caching with lazy loading to load the JSON file only once.

### 2. Type Annotation Issues (MEDIUM PRIORITY)
**Locations**: Multiple files with incorrect type annotations

**Issues**:
- `src/utils.py`: Functions use `str = None` instead of `Optional[str]`
- `src/inference.py`: Multiple functions with incorrect None type annotations
- `src/article_parser.py`: Type mismatches in function parameters
- `src/components/`: Similar type annotation issues across component files

**Impact**:
- Static type checking failures
- Potential runtime errors
- Poor code maintainability
- IDE/tooling issues

### 3. Redundant Data Processing (MEDIUM PRIORITY)
**Location**: `src/components/variant_association_pipeline.py`

**Issue**: The pipeline calls `get_article_text()` multiple times for the same article across different processing steps.

**Impact**:
- Redundant file I/O operations
- Unnecessary string processing
- Memory inefficiency

### 4. Inefficient List Iteration Patterns (LOW PRIORITY)
**Location**: `src/utils.py:55-66` - `compare_lists()` function

**Issue**: Multiple iterations over the same lists for coloring operations.

**Impact**:
- Multiple O(n) operations that could be combined
- Redundant set membership checks

## Implemented Fix

### JSON Caching Optimization
**File**: `src/utils.py`
**Function**: `get_true_variants()`

**Changes**:
- Added module-level cache variable `_true_variant_cache`
- Implemented lazy loading pattern
- Added proper error handling for missing files
- Used context manager for safe file handling

**Benefits**:
- JSON file loaded only once per module import
- Significant performance improvement for repeated calls
- Proper resource management
- Thread-safe implementation

## Recommendations for Future Improvements

1. **Type Annotations**: Fix all type annotation issues across the codebase
2. **Article Text Caching**: Implement caching for article text loading
3. **Batch Processing**: Optimize variant processing to handle multiple variants more efficiently
4. **Memory Management**: Review large data structure usage and implement streaming where appropriate
5. **Database Integration**: Consider using a database instead of JSON files for better performance

## Testing Recommendations

1. Create performance benchmarks for the JSON loading optimization
2. Add unit tests for the caching mechanism
3. Implement integration tests to ensure functionality is preserved
4. Add memory usage monitoring for large dataset processing

## Conclusion

The most critical efficiency issue was the repeated JSON file loading in `get_true_variants()`. This fix provides immediate performance benefits with minimal risk. The type annotation issues should be addressed in a follow-up PR to improve code quality and maintainability.
Loading
Loading