Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,13 @@ lightning_logs/
data_temp/
temp/

# output
test_full_workflow/

# tfs files
simba/configs/*/tfs*.yaml
run_scripts_tfs*

# ============================================================================
# Code Quality Tools
# ============================================================================
Expand Down
32 changes: 32 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,38 @@ simba preprocess \

---

**Reusing Precomputed Distances:**

To speed up preprocessing when working with related datasets (e.g., MS2-only, MS3-only, and joint MS2+MS3), you can reuse previously computed molecular distances:

```bash
# First: preprocess MS2-only data
simba preprocess \
paths.spectra_path=ms2_spectra.mgf \
paths.preprocessing_dir=./ms2_preprocessing/

# Then: preprocess MS3-only data
simba preprocess \
paths.spectra_path=ms3_spectra.mgf \
paths.preprocessing_dir=./ms3_preprocessing/

# Finally: preprocess joint dataset, reusing distances from both
simba preprocess \
paths.spectra_path=joint_spectra.mgf \
paths.preprocessing_dir=./joint_preprocessing/ \
'preprocessing.precomputed_distances=[./ms2_preprocessing/, ./ms3_preprocessing/]'
```

The cache automatically:
- Finds all distance files (`edit_distance_*.npy`, `mces_*.npy`) in each directory
- Loads SMILES mappings from `mapping_unique_smiles.pkl`
- Matches molecules by SMILES strings (robust to different splits/filters)
- Logs cache hit/miss statistics during computation

**Cache hit rate = % of molecule pairs that were reused instead of recomputed!**

---

**Quick Testing (Fast Dev Mode):**

```bash
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ dependencies = [
"pyteomics>=4.6.0",
"depthcharge-ms @ git+https://github.com/wfondrie/depthcharge.git@bd2861f",
"myopic-mces>=1.0.0,<2.0.0",
"highspy>=1.13.1",
# Data processing
"h5py>=3.10.0",
"pyarrow>=15.0.0",
Expand Down
4 changes: 4 additions & 0 deletions simba/commands/analog_discovery.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,5 +191,9 @@ def _analog_discovery_with_hydra(
click.echo("=" * 70)

except Exception as e:
import traceback

click.echo(f"\n❌ Error during analog discovery: {e}", err=True)
click.echo("\nFull traceback:", err=True)
click.echo(traceback.format_exc(), err=True)
raise click.Abort() from e
1 change: 1 addition & 0 deletions simba/configs/model/simba_default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ features:
use_element_wise: true
categorical_adducts: false
use_only_protonized_adducts: false
use_ion_mode: false

# Metadata features
use_ce: false
Expand Down
8 changes: 8 additions & 0 deletions simba/configs/preprocessing/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,13 @@ test_split: 0.1 # Test split fraction (0.0-1.0)
random_mces_sampling: false
use_only_protonized_adducts: true

# Precomputed distances - reuse distances from previous preprocessing runs
# Just list preprocessing directories - auto-discovers all distance files
precomputed_distances:
# Examples:
# - "./test_precomputed_cache/dataset1/"
# - "./ms2_preprocessing/"
# - "./ms3_preprocessing/"

# Subsampling
subsample_preprocessing: false
Loading
Loading