Description
In simba/mces/mces_computation.py (lines 342-434), the compute_all_mces_results_exhaustive function has a bug where computed_pair_distances is overwritten for each processed chunk instead of being accumulated. This results in the returned MolecularPairsSet containing only the last processed chunk when a node processes multiple chunks.
Location
File: simba/mces/mces_computation.py
Function: compute_all_mces_results_exhaustive
Lines: 342-434
Current Behavior
computed_pair_distances = np.empty((0, 3)) # Line 342
for chunk_idx, chunk in enumerate(chunks): # Line 344
if ((chunk_idx % config.PREPROCESSING_NUM_NODES) == config.PREPROCESSING_CURRENT_NODE) or (config.PREPROCESSING_CURRENT_NODE is None):
# ... chunk processing ...
# Get results from async objects
computed_pair_distances = np.concatenate( # Line 421 - OVERWRITES!
[result.get() for result in results], axis=0
)
# save distances per chunk
np.save(arr=computed_pair_distances, file=filename) # Line 426
logger.info(f"Computed distances for {computed_pair_distances.shape[0]} pairs.") # Line 429 - WRONG COUNT!
molecular_pair_set = MolecularPairsSet(
spectra=all_spectra, pair_distances=computed_pair_distances # Line 433 - INCOMPLETE!
)
Problem
- Overwriting instead of accumulating: Each iteration overwrites
computed_pair_distances instead of appending to it
- Incorrect logging: The final logger message (line 429) reports only the size of the last chunk
- Incomplete return value: The returned
MolecularPairsSet (line 433) contains only the last chunk's data
- Used return value: The return value is used in line 72 and passed to line 81, making this a real bug
When Does This Occur?
- When
PREPROCESSING_NUM_NODES > 1 and multiple chunks map to the same node
- When
PREPROCESSING_CURRENT_NODE is None (processes all chunks)
Impact
- ⚠️ Low impact if: Each node typically processes only one chunk and callers always load data via
LoadMCES.load_raw_data
- 🔴 High impact if: The returned
MolecularPairsSet is used directly without reloading from files
Note: All chunk data is correctly saved to separate files (line 426), so data is not lost - but the return value and logging are incorrect.
Expected Behavior
Accumulate all processed chunks:
computed_pair_distances = np.empty((0, 3))
all_computed = [] # NEW: accumulator
for chunk_idx, chunk in enumerate(chunks):
if ((chunk_idx % config.PREPROCESSING_NUM_NODES) == config.PREPROCESSING_CURRENT_NODE) or (config.PREPROCESSING_CURRENT_NODE is None):
# ... chunk processing ...
# Get results from async objects
chunk_distances = np.concatenate(
[result.get() for result in results], axis=0
)
all_computed.append(chunk_distances) # NEW: accumulate
# save distances per chunk (unchanged)
np.save(arr=chunk_distances, file=filename)
# NEW: Concatenate all chunks after the loop
if all_computed:
computed_pair_distances = np.concatenate(all_computed, axis=0)
logger.info(f"Computed distances for {computed_pair_distances.shape[0]} pairs.") # NOW CORRECT!
molecular_pair_set = MolecularPairsSet(
spectra=all_spectra, pair_distances=computed_pair_distances # NOW COMPLETE!
)
Description
In
simba/mces/mces_computation.py(lines 342-434), thecompute_all_mces_results_exhaustivefunction has a bug wherecomputed_pair_distancesis overwritten for each processed chunk instead of being accumulated. This results in the returnedMolecularPairsSetcontaining only the last processed chunk when a node processes multiple chunks.Location
File:
simba/mces/mces_computation.pyFunction:
compute_all_mces_results_exhaustiveLines: 342-434
Current Behavior
Problem
computed_pair_distancesinstead of appending to itMolecularPairsSet(line 433) contains only the last chunk's dataWhen Does This Occur?
PREPROCESSING_NUM_NODES > 1and multiple chunks map to the same nodePREPROCESSING_CURRENT_NODE is None(processes all chunks)Impact
LoadMCES.load_raw_dataMolecularPairsSetis used directly without reloading from filesNote: All chunk data is correctly saved to separate files (line 426), so data is not lost - but the return value and logging are incorrect.
Expected Behavior
Accumulate all processed chunks: