Improve protein ID normalization and expand DE analysis in Mokume by Shen-YuFei · Pull Request #18 · bigbio/mokume

Shen-YuFei · 2026-04-14T07:09:59Z

This pull request introduces comprehensive documentation and usability improvements for differential expression analysis, normalization, and workflow configuration in mokume. It adds a new conceptual guide for differential expression, expands the README with detailed API and CLI usage for new statistical methods and normalization techniques, clarifies configuration options, and improves discoverability of features throughout the documentation. Optional dependency installation instructions are also updated for clarity.

Major documentation and feature additions:

Differential Expression Analysis:

Added a new conceptual guide at docs/concepts/differential-expression.md detailing LimROTS, DEqMS, and proDA methods, their strengths, weaknesses, and selection guidance. Includes benchmarking results and practical recommendations.
Expanded the README.md with a new section on differential expression, including method descriptions, Python API usage, CLI examples, and output interpretation.
Linked the new differential expression documentation from the main index and concept pages for easier navigation. [1] [2]

Normalization and Imputation:

Documented LOESS normalization as a new utility in both the README.md and docs/concepts/normalization.md, with code examples and usage notes. [1] [2]
Clarified and expanded normalization method options, updating CLI and API documentation for feature-level and sample-level normalization, including new methods like max, global, max_min, hierarchical, and tmm. [1] [2] [3]

Workflow and Configuration Improvements:

Updated pipeline configuration examples to use explicit config classes for input, quantification, and batch correction, making advanced options clearer and more discoverable.
Added a new "Choose Your Workflow" section to the main documentation index for guiding users to the appropriate entry point.

Installation and Optional Extras:

Clarified installation instructions for standard and optional features, including new extras for tissuemap and improved descriptions for batch correction and interactive reports. [1] [2] [3] [4]

These changes make mokume's advanced statistical and normalization methods more accessible, improve onboarding for new users, and ensure that documentation matches the latest feature set and best practices.

…n, LOESS, docs

coderabbitai · 2026-04-14T07:10:07Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cc2e1ea4-0af9-4dba-a184-04cb948663c9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-04-14T07:11:34Z

Not up to standards ⛔

🔴 Issues 16 minor

Alerts:
⚠ 16 issues (≤ 0 issues of at least minor severity)

Results:
16 new issues

Category Results

Documentation 14 minor

CodeStyle 2 minor

View in Codacy

🟢 Metrics 39 complexity · 6 duplication

Metric Results

Complexity 39

Duplication 6

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

Copilot

Pull request overview

This PR adds new differential expression (DE) analysis capabilities to mokume (LimROTS, DEqMS, proDA), improves protein accession normalization for FASTA matching, and expands CLI/config/docs to expose the new options.

Changes:

Introduces a new mokume.analysis DE API (DifferentialExpression) plus implementations of LimROTS, DEqMS, and proDA, and wires DE into the pipeline with an "auto" method selector.
Improves protein identifier normalization via get_accession()/build_accession_map() and applies it to FASTA-based peptide counting / molecular weight calculation paths.
Adds censored-aware imputation utilities and LOESS normalization, plus new tests and documentation/CLI/config updates for these features.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_features2proteins_cli.py	Adds CLI tests covering DirectLFQ and batch correction option plumbing/validation.
tests/test_de_and_imputation.py	Adds tests for DE methods, IHW path, censored imputation, and LOESS normalization.
tests/test_accession_normalization.py	Adds tests for accession parsing/mapping and FASTA sequence cleaning helper.
mokume/quantification/ibaq.py	Normalizes protein identifiers when matching FASTA entries for iBAQ/TPA calculations.
mokume/pipeline/stages.py	Adds DE method auto-selection and optional peptide-count loading for DEqMS; pins pandas groupby `observed`.
mokume/pipeline/features_to_proteins.py	Extends pipeline entrypoint parameters/config wiring for new DE and DirectLFQ options.
mokume/pipeline/config.py	Adds `ImputationConfig`, extends `DEConfig` with `fdr_method`, and adds `imputation` to `PipelineConfig`.
mokume/normalization/loess.py	Introduces LOESS normalization implementation and functional wrapper.
mokume/normalization/init.py	Exports LOESS normalization API.
mokume/io/fasta.py	Applies accession normalization and shared nonstandard-AA stripping in FASTA extraction and MW lookup.
mokume/imputation/censored.py	Adds censored-aware imputation methods (MinProb/MinDet/KNN) and missingness classification.
mokume/imputation/init.py	Exports new censored-aware imputation functions.
mokume/core/constants.py	Enhances `get_accession()` and adds `build_accession_map()` for normalization.
mokume/commands/features2proteins.py	Expands CLI with batch correction options, new DE methods, and `--de-fdr-method`; adds argument validation.
mokume/analysis/proda.py	Adds simplified proDA-style dropout-aware DE implementation.
mokume/analysis/limrots.py	Adds LimROTS DE implementation with bootstrap/permutation optimization.
mokume/analysis/differential_expression.py	Refactors DE orchestration, adds IHW support helpers, and integrates LimROTS/DEqMS/proDA dispatch.
mokume/analysis/deqms.py	Adds DEqMS implementation (spectra-count eBayes with variance–count modeling).
mokume/analysis/init.py	Exports new analysis entrypoints for direct import.
mokume/init.py	Suppresses a pyopenms OPENMS_DATA_PATH warning.
docs/user-guide/features2proteins.md	Updates user guide examples and option tables for new DE/FDR options.
docs/reference/configuration.md	Documents new DE config defaults and `fdr_method`.
docs/reference/cli.md	Updates CLI reference with batch correction and new DE/FDR options.
README.md	Documents new analysis/imputation/normalization modules and provides usage examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T07:14:30Z

    def _finalize_results(self, de_df: pd.DataFrame) -> pd.DataFrame:
        """Apply FDR correction and classify significance."""
-        # FDR correction (Benjamini-Hochberg)
-        reject, adj_pvalues, _, _ = multipletests(
-            de_df["pvalue"].values, method="fdr_bh"
-        )
-        de_df["adj_pvalue"] = adj_pvalues
+        if self.fdr_method == "ihw" and "adj_pvalue" not in de_df.columns:
+            de_df["adj_pvalue"] = _ihw_correction(
+                de_df["pvalue"].values,
+                de_df,
+                alpha=self.fdr_threshold,
+            )
+        elif "adj_pvalue" not in de_df.columns:
+            # Benjamini-Hochberg (default)
+            de_df["adj_pvalue"] = multipletests(
+                de_df["pvalue"].values, method="fdr_bh"
+            )[1]



fdr_method is effectively ignored for all DE methods right now: run_limrots, run_deqms, and run_proda each return an adj_pvalue column, and _finalize_results only applies IHW/BH when adj_pvalue is missing. This means --de-fdr-method ihw (and even BH recomputation) won’t take effect. Consider always recomputing/overriding adj_pvalue based on self.fdr_method (optionally preserving any method-specific adjusted values under a different column name).

Copilot · 2026-04-14T07:14:31Z

+    if n_obs < 5 or n_obs == n_total:
+        return np.nanmedian(observed) - 2.0, 1.0
+


_fit_dropout_curve() calls np.nanmedian(observed) even when observed is empty (n_obs == 0), which yields nan parameters and can propagate NaNs through the DE results. Handle the all-missing case explicitly (e.g., return a fixed fallback rho/zeta based on a global default) before calling nanmedian.

Copilot · 2026-04-14T07:14:31Z

+# Multiple contrasts
+contrasts = [("Treatment", "Control"), ("Drug", "Control")]
+results = de.run_comparisons(protein_df, sample_to_condition, contrasts)
+# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}


The README example for run_comparisons() shows keys like "Treatment_vs_Control", but the implementation uses f"{contrast[0]}-{contrast[1]}" (e.g., "Treatment-Control"). Update the example to match the actual keys returned so users can index results correctly.

Suggested change

# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}

# Returns dict: {"Treatment-Control": DataFrame, "Drug-Control": DataFrame}

Copilot · 2026-04-14T07:14:31Z

+imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0)
+
+# MinDet: replace NaN with per-column quantile
+imputed = impute_censored(log2_matrix, method="mindet", q=0.01)
+
+# KNN: k-nearest neighbor imputation
+imputed = impute_censored(log2_matrix, method="knn", k=10)


The impute_censored README examples use keyword args (q, tune_sigma, k) that don’t match the actual function signature (quantile, shift, scale, n_neighbors, …). As written, these snippets will raise TypeError. Please update the example arguments to the real parameter names.

Suggested change

imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0)

# MinDet: replace NaN with per-column quantile

imputed = impute_censored(log2_matrix, method="mindet", q=0.01)

# KNN: k-nearest neighbor imputation

imputed = impute_censored(log2_matrix, method="knn", k=10)

imputed = impute_censored(log2_matrix, method="minprob", quantile=0.01, scale=1.0)

# MinDet: replace NaN with per-column quantile

imputed = impute_censored(log2_matrix, method="mindet", quantile=0.01)

# KNN: k-nearest neighbor imputation

imputed = impute_censored(log2_matrix, method="knn", n_neighbors=10)

Copilot · 2026-04-14T07:14:32Z

+class TestIHW:
+    def test_ihw_via_de(self):
+        mat, sa, sb, _ = _make_protein_matrix()
+        wide, s2c = _make_wide_df(mat, sa, sb)
+        de = DifferentialExpression(
+            method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True,
+        )
+        result = de.run(wide, s2c, ("A", "B"))
+        assert len(result) > 0
+        assert "adj_pvalue" in result.columns
+
+    def test_ihw_fallback_on_small_data(self):
+        # Very few proteins → should fall back to BH
+        mat, sa, sb, _ = _make_protein_matrix(n_proteins=5)
+        wide, s2c = _make_wide_df(mat, sa, sb)
+        de = DifferentialExpression(
+            method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True,
+        )
+        result = de.run(wide, s2c, ("A", "B"))
+        assert len(result) > 0
+


The IHW tests currently only assert that an adj_pvalue column exists, which would also pass if BH is being applied (or if IHW is ignored). Since --de-fdr-method ihw is a user-facing option, add an assertion that specifically verifies IHW is applied (or that the BH fallback path is taken for small datasets).

…y-obs guard, README examples

…file, per-contrast heatmaps, fix z-score/PCA plotting

Shen-YuFei added 2 commits April 8, 2026 15:57

fix(fasta): normalize protein ID matching for non-standard formats

6f850ef

feat(analysis): add LimROTS/DEqMS/proDA DE, auto-selection, imputatio…

709ad5c

…n, LOESS, docs

Copilot AI review requested due to automatic review settings April 14, 2026 07:10

Copilot started reviewing on behalf of Shen-YuFei April 14, 2026 07:10 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Shen-YuFei added 10 commits April 14, 2026 16:43

fix(analysis): clear codacy warnings

98fa792

fix(analysis): clear codacy warnings

ad4e619

Merge branch 'dev' of https://github.com/Shen-YuFei/mokume into dev

43ee40d

fix

45c9a62

docs: align documentation with current codebase and fix API kwarg bug

804f8a8

refactor(analysis): remove dead code, YAGNI guards, and unused exports

ac113d9

fix(analysis): address Copilot PR review — IHW fdr_method, proda empt…

c20ef3d

…y-obs guard, README examples

docs(concepts): add differential expression method comparison page

bee09c6

docs: fix README CLI drift and add DE card to index

59da25e

feat(de): replace auto-detect contrasts with explicit --de-contrasts-…

aa0df9e

…file, per-contrast heatmaps, fix z-score/PCA plotting

ypriverol merged commit 48e0249 into bigbio:dev Apr 18, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve protein ID normalization and expand DE analysis in Mokume#18

Improve protein ID normalization and expand DE analysis in Mokume#18
ypriverol merged 12 commits intobigbio:devfrom
Shen-YuFei:dev

Shen-YuFei commented Apr 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

Review skipped

Uh oh!

codacy-production Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if n_obs < 5 or n_obs == n_total:
		return np.nanmedian(observed) - 2.0, 1.0

	# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}
	# Returns dict: {"Treatment-Control": DataFrame, "Drug-Control": DataFrame}

Conversation

Shen-YuFei commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codacy-production Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shen-YuFei commented Apr 14, 2026 •

edited

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

codacy-production Bot commented Apr 14, 2026 •

edited

Loading