Skip to content

Rewrite Feature_extraction with parallel processing and per-channel alignment#43

Merged
smishra3 merged 3 commits intoreSub-v1from
feature/parallel-feature-extraction-with-alignment
Feb 6, 2026
Merged

Rewrite Feature_extraction with parallel processing and per-channel alignment#43
smishra3 merged 3 commits intoreSub-v1from
feature/parallel-feature-extraction-with-alignment

Conversation

@smishra3
Copy link
Collaborator

@smishra3 smishra3 commented Feb 3, 2026

Summary

  • Parallelize movie processing using joblib (n_jobs=32) for faster extraction

  • Fix dual-camera alignment: apply segmentation mask alignment only when the raw fluorescence channel is on a different camera than the brightfield. Camera 1 (brightfield, 638nm) channels share the mask coordinate space; Camera 2 (488nm, 561nm) channels require the inverse calibration transform. Previously, the same alignment was applied uniformly to all channels, causing misalignment for channels on the same camera as the mask.

  • Extract intensity from Channel 2 and Channel 3 when available, with per-channel alignment based on wavelength-to-camera mapping from manifest

  • Add retry logic with exponential backoff for BioImage loading and dask array computation to handle transient network errors

  • Support fixed-cell (immunostaining) experiments with correct timepoint calculation from fixation time metadata

  • Add local file loading option to io.load_imaging_and_segmentation_dataset()

  • Add setuptools package discovery config to pyproject.toml

  • Add joblib dependency to pyproject.toml

  • Improve README with pipeline overview, step descriptions, gene metric definitions, and dual-camera alignment explanation

    Test plan

    • Verify imports and unit tests pass in fresh venv
    • Run feature extraction on a subset of movies
    • Compare extracted intensities for multi-camera movies against previous output"

…lignment

- Parallelize movie processing using joblib (n_jobs=32) for faster extraction
- Fix dual-camera alignment: apply segmentation mask alignment only when the
  raw fluorescence channel is on a different camera than the brightfield.
  Camera 1 (brightfield, 638nm) channels share the mask coordinate space;
  Camera 2 (488nm, 561nm) channels require the inverse calibration transform.
  Previously, the same alignment was applied uniformly to all channels,
  causing misalignment for channels on the same camera as the mask.
- Extract intensity from Channel 2 and Channel 3 when available, with
  per-channel alignment based on wavelength-to-camera mapping from manifest
- Add retry logic with exponential backoff for BioImage loading and dask
  array computation to handle transient network errors
- Support fixed-cell (immunostaining) experiments with correct timepoint
  calculation from fixation time metadata
- Add local file loading option to io.load_imaging_and_segmentation_dataset()
- Add setuptools package discovery config to pyproject.toml
- Add joblib dependency to pyproject.toml
- Improve README with pipeline overview, step descriptions, gene metric
  definitions, and dual-camera alignment explanation
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the feature extraction pipeline to improve performance through parallel processing, fix dual-camera alignment issues, and add support for fixed-cell experiments. The main change addresses a critical bug where segmentation mask alignment was incorrectly applied uniformly to all channels, rather than selectively based on which camera each wavelength is captured by.

Changes:

  • Implemented parallel movie processing using joblib (32 workers) to significantly reduce extraction runtime
  • Fixed dual-camera alignment logic to apply coordinate transforms only when raw fluorescence and segmentation mask are on different cameras
  • Added retry logic with exponential backoff for network errors when loading BioImage files and computing dask arrays
  • Extended extraction to support additional fluorescence channels (Channel 2 and Channel 3) with per-channel alignment
  • Added support for fixed-cell immunostaining experiments with correct timepoint calculation from fixation metadata
  • Enhanced documentation with pipeline overview and detailed step descriptions

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
pyproject.toml Added joblib dependency and setuptools package discovery configuration
README.md Expanded documentation with pipeline steps, gene metric definitions, and dual-camera alignment explanation
EMT_data_analysis/tools/io.py Added local file loading option to load_imaging_and_segmentation_dataset()
EMT_data_analysis/analysis_scripts/Feature_extraction.py Complete rewrite with parallel processing, per-channel alignment, retry logic, and multi-channel extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

]

[tool.setuptools.packages.find]
include = ["EMT_data_analysis*"]
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [tool.setuptools.packages.find] configuration is missing the where parameter. Without specifying where = ['.'] or the appropriate source directory, setuptools may not correctly discover packages. Consider adding where = ['.'] to explicitly define the package search location.

Suggested change
include = ["EMT_data_analysis*"]
include = ["EMT_data_analysis*"]
where = ["."]

Copilot uses AI. Check for mistakes.
path = local_path
else:
# Default local path: project root (parent of EMT_data_analysis package)
project_root = Path(__file__).parent.parent.parent
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using parent.parent.parent to navigate directory hierarchy is fragile and difficult to understand. Consider using a more explicit approach such as defining a project root constant or using Path(__file__).parents[2] with a comment explaining the directory structure.

Copilot uses AI. Check for mistakes.
except Exception as e:
last_error = e
if attempt < max_retries - 1:
wait_time = min(2 ** attempt, 30)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 30 (maximum wait time in seconds) should be extracted as a named constant (e.g., MAX_RETRY_WAIT_SECONDS = 30) to improve code clarity and maintainability.

Copilot uses AI. Check for mistakes.
except (ServerDisconnectedError, ClientError, ConnectionError, TimeoutError, UnsupportedFileFormatError) as e:
last_error = e
if attempt < max_retries - 1:
wait_time = min(2 ** attempt, 60)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 60 (maximum wait time in seconds) should be extracted as a named constant (e.g., MAX_RETRY_WAIT_SECONDS = 60) to improve code clarity and maintainability. Note this differs from the 30-second cap in load_image_with_retry—consider whether these should be consistent.

Copilot uses AI. Check for mistakes.

# Determine timepoints to process (first 48 hours = 98 timepoints)
num_timepoints = raw_reader.dims.T
max_timepoint = min(num_timepoints, 98)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 98 (representing 48 hours of timepoints) should be extracted as a named constant (e.g., MAX_TIMEPOINTS_48_HOURS = 98) with a comment explaining the calculation basis.

Copilot uses AI. Check for mistakes.
except Exception as e:
return (movie_id, False, f"{type(e).__name__}: {str(e)}")

def compute_bf_colony_features_all_movies(output_folder, align=True, n_jobs=32):
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default n_jobs=32 may be too aggressive for systems with fewer CPU cores and could cause resource contention or memory issues. Consider using n_jobs=-1 (all available cores) or calculating based on os.cpu_count() to adapt to the system's capabilities.

Copilot uses AI. Check for mistakes.
smishra3 and others added 2 commits February 3, 2026 10:19
CI runs `pdm export -f requirements` which includes hashes by default.
The previous export used --no-hashes, causing the diff check to fail.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@smishra3 smishra3 merged commit c71d5a7 into reSub-v1 Feb 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants