Rewrite Feature_extraction with parallel processing and per-channel alignment by smishra3 · Pull Request #43 · AllenCell/EMT_data_analysis

smishra3 · 2026-02-03T17:11:40Z

Summary

Parallelize movie processing using joblib (n_jobs=32) for faster extraction
Fix dual-camera alignment: apply segmentation mask alignment only when the raw fluorescence channel is on a different camera than the brightfield. Camera 1 (brightfield, 638nm) channels share the mask coordinate space; Camera 2 (488nm, 561nm) channels require the inverse calibration transform. Previously, the same alignment was applied uniformly to all channels, causing misalignment for channels on the same camera as the mask.
Extract intensity from Channel 2 and Channel 3 when available, with per-channel alignment based on wavelength-to-camera mapping from manifest
Add retry logic with exponential backoff for BioImage loading and dask array computation to handle transient network errors
Support fixed-cell (immunostaining) experiments with correct timepoint calculation from fixation time metadata
Add local file loading option to io.load_imaging_and_segmentation_dataset()
Add setuptools package discovery config to pyproject.toml
Add joblib dependency to pyproject.toml
Improve README with pipeline overview, step descriptions, gene metric definitions, and dual-camera alignment explanation

Test plan
- Verify imports and unit tests pass in fresh venv
- Run feature extraction on a subset of movies
- Compare extracted intensities for multi-camera movies against previous output"

…lignment - Parallelize movie processing using joblib (n_jobs=32) for faster extraction - Fix dual-camera alignment: apply segmentation mask alignment only when the raw fluorescence channel is on a different camera than the brightfield. Camera 1 (brightfield, 638nm) channels share the mask coordinate space; Camera 2 (488nm, 561nm) channels require the inverse calibration transform. Previously, the same alignment was applied uniformly to all channels, causing misalignment for channels on the same camera as the mask. - Extract intensity from Channel 2 and Channel 3 when available, with per-channel alignment based on wavelength-to-camera mapping from manifest - Add retry logic with exponential backoff for BioImage loading and dask array computation to handle transient network errors - Support fixed-cell (immunostaining) experiments with correct timepoint calculation from fixation time metadata - Add local file loading option to io.load_imaging_and_segmentation_dataset() - Add setuptools package discovery config to pyproject.toml - Add joblib dependency to pyproject.toml - Improve README with pipeline overview, step descriptions, gene metric definitions, and dual-camera alignment explanation

Copilot

Pull request overview

This PR refactors the feature extraction pipeline to improve performance through parallel processing, fix dual-camera alignment issues, and add support for fixed-cell experiments. The main change addresses a critical bug where segmentation mask alignment was incorrectly applied uniformly to all channels, rather than selectively based on which camera each wavelength is captured by.

Changes:

Implemented parallel movie processing using joblib (32 workers) to significantly reduce extraction runtime
Fixed dual-camera alignment logic to apply coordinate transforms only when raw fluorescence and segmentation mask are on different cameras
Added retry logic with exponential backoff for network errors when loading BioImage files and computing dask arrays
Extended extraction to support additional fluorescence channels (Channel 2 and Channel 3) with per-channel alignment
Added support for fixed-cell immunostaining experiments with correct timepoint calculation from fixation metadata
Enhanced documentation with pipeline overview and detailed step descriptions

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
pyproject.toml	Added joblib dependency and setuptools package discovery configuration
README.md	Expanded documentation with pipeline steps, gene metric definitions, and dual-camera alignment explanation
EMT_data_analysis/tools/io.py	Added local file loading option to load_imaging_and_segmentation_dataset()
EMT_data_analysis/analysis_scripts/Feature_extraction.py	Complete rewrite with parallel processing, per-channel alignment, retry logic, and multi-channel extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T17:12:45Z

pyproject.toml

 ]

+[tool.setuptools.packages.find]
+include = ["EMT_data_analysis*"]


The [tool.setuptools.packages.find] configuration is missing the where parameter. Without specifying where = ['.'] or the appropriate source directory, setuptools may not correctly discover packages. Consider adding where = ['.'] to explicitly define the package search location.

Suggested change

include = ["EMT_data_analysis*"]

include = ["EMT_data_analysis*"]

where = ["."]

Copilot · 2026-02-03T17:12:45Z

EMT_data_analysis/tools/io.py

+            path = local_path
+        else:
+            # Default local path: project root (parent of EMT_data_analysis package)
+            project_root = Path(__file__).parent.parent.parent


Using parent.parent.parent to navigate directory hierarchy is fragile and difficult to understand. Consider using a more explicit approach such as defining a project root constant or using Path(__file__).parents[2] with a comment explaining the directory structure.

Copilot · 2026-02-03T17:12:46Z

EMT_data_analysis/analysis_scripts/Feature_extraction.py

+        except Exception as e:
+            last_error = e
+            if attempt < max_retries - 1:
+                wait_time = min(2 ** attempt, 30)


The magic number 30 (maximum wait time in seconds) should be extracted as a named constant (e.g., MAX_RETRY_WAIT_SECONDS = 30) to improve code clarity and maintainability.

Copilot · 2026-02-03T17:12:46Z

EMT_data_analysis/analysis_scripts/Feature_extraction.py

+        except (ServerDisconnectedError, ClientError, ConnectionError, TimeoutError, UnsupportedFileFormatError) as e:
+            last_error = e
+            if attempt < max_retries - 1:
+                wait_time = min(2 ** attempt, 60)


The magic number 60 (maximum wait time in seconds) should be extracted as a named constant (e.g., MAX_RETRY_WAIT_SECONDS = 60) to improve code clarity and maintainability. Note this differs from the 30-second cap in load_image_with_retry—consider whether these should be consistent.

Copilot · 2026-02-03T17:12:46Z

EMT_data_analysis/analysis_scripts/Feature_extraction.py

+
+        # Determine timepoints to process (first 48 hours = 98 timepoints)
+        num_timepoints = raw_reader.dims.T
+        max_timepoint = min(num_timepoints, 98)


The magic number 98 (representing 48 hours of timepoints) should be extracted as a named constant (e.g., MAX_TIMEPOINTS_48_HOURS = 98) with a comment explaining the calculation basis.

Copilot · 2026-02-03T17:12:47Z

EMT_data_analysis/analysis_scripts/Feature_extraction.py

+    except Exception as e:
+        return (movie_id, False, f"{type(e).__name__}: {str(e)}")
+
+def compute_bf_colony_features_all_movies(output_folder, align=True, n_jobs=32):


The default n_jobs=32 may be too aggressive for systems with fewer CPU cores and could cause resource contention or memory issues. Consider using n_jobs=-1 (all available cores) or calculating based on os.cpu_count() to adapt to the system's capabilities.

CI runs `pdm export -f requirements` which includes hashes by default. The previous export used --no-hashes, causing the diff check to fail. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

smishra3 requested review from Copilot, kiryteo and vianamp February 3, 2026 17:11

Copilot AI reviewed Feb 3, 2026

View reviewed changes

smishra3 and others added 2 commits February 3, 2026 10:19

Update pdm.lock and requirements.txt with joblib dependency

c421f27

Regenerate requirements.txt with hashes to match CI export format

6c7cb58

CI runs `pdm export -f requirements` which includes hashes by default. The previous export used --no-hashes, causing the diff check to fail. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vianamp approved these changes Feb 4, 2026

View reviewed changes

kiryteo approved these changes Feb 5, 2026

View reviewed changes

smishra3 merged commit c71d5a7 into reSub-v1 Feb 6, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite Feature_extraction with parallel processing and per-channel alignment#43

Rewrite Feature_extraction with parallel processing and per-channel alignment#43
smishra3 merged 3 commits intoreSub-v1from
feature/parallel-feature-extraction-with-alignment

smishra3 commented Feb 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	include = ["EMT_data_analysis*"]
	include = ["EMT_data_analysis*"]
	where = ["."]

Conversation

smishra3 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smishra3 commented Feb 3, 2026 •

edited

Loading