Skip to content

Improve protein ID normalization and expand DE analysis in Mokume#18

Merged
ypriverol merged 12 commits intobigbio:devfrom
Shen-YuFei:dev
Apr 18, 2026
Merged

Improve protein ID normalization and expand DE analysis in Mokume#18
ypriverol merged 12 commits intobigbio:devfrom
Shen-YuFei:dev

Conversation

@Shen-YuFei
Copy link
Copy Markdown
Contributor

@Shen-YuFei Shen-YuFei commented Apr 14, 2026

This pull request introduces comprehensive documentation and usability improvements for differential expression analysis, normalization, and workflow configuration in mokume. It adds a new conceptual guide for differential expression, expands the README with detailed API and CLI usage for new statistical methods and normalization techniques, clarifies configuration options, and improves discoverability of features throughout the documentation. Optional dependency installation instructions are also updated for clarity.

Major documentation and feature additions:

Differential Expression Analysis:

  • Added a new conceptual guide at docs/concepts/differential-expression.md detailing LimROTS, DEqMS, and proDA methods, their strengths, weaknesses, and selection guidance. Includes benchmarking results and practical recommendations.
  • Expanded the README.md with a new section on differential expression, including method descriptions, Python API usage, CLI examples, and output interpretation.
  • Linked the new differential expression documentation from the main index and concept pages for easier navigation. [1] [2]

Normalization and Imputation:

  • Documented LOESS normalization as a new utility in both the README.md and docs/concepts/normalization.md, with code examples and usage notes. [1] [2]
  • Clarified and expanded normalization method options, updating CLI and API documentation for feature-level and sample-level normalization, including new methods like max, global, max_min, hierarchical, and tmm. [1] [2] [3]

Workflow and Configuration Improvements:

  • Updated pipeline configuration examples to use explicit config classes for input, quantification, and batch correction, making advanced options clearer and more discoverable.
  • Added a new "Choose Your Workflow" section to the main documentation index for guiding users to the appropriate entry point.

Installation and Optional Extras:

  • Clarified installation instructions for standard and optional features, including new extras for tissuemap and improved descriptions for batch correction and interactive reports. [1] [2] [3] [4]

These changes make mokume's advanced statistical and normalization methods more accessible, improve onboarding for new users, and ensure that documentation matches the latest feature set and best practices.

Copilot AI review requested due to automatic review settings April 14, 2026 07:10
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cc2e1ea4-0af9-4dba-a184-04cb948663c9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 14, 2026

Not up to standards ⛔

🔴 Issues 16 minor

Alerts:
⚠ 16 issues (≤ 0 issues of at least minor severity)

Results:
16 new issues

Category Results
Documentation 14 minor
CodeStyle 2 minor

View in Codacy

🟢 Metrics 39 complexity · 6 duplication

Metric Results
Complexity 39
Duplication 6

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new differential expression (DE) analysis capabilities to mokume (LimROTS, DEqMS, proDA), improves protein accession normalization for FASTA matching, and expands CLI/config/docs to expose the new options.

Changes:

  • Introduces a new mokume.analysis DE API (DifferentialExpression) plus implementations of LimROTS, DEqMS, and proDA, and wires DE into the pipeline with an "auto" method selector.
  • Improves protein identifier normalization via get_accession()/build_accession_map() and applies it to FASTA-based peptide counting / molecular weight calculation paths.
  • Adds censored-aware imputation utilities and LOESS normalization, plus new tests and documentation/CLI/config updates for these features.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_features2proteins_cli.py Adds CLI tests covering DirectLFQ and batch correction option plumbing/validation.
tests/test_de_and_imputation.py Adds tests for DE methods, IHW path, censored imputation, and LOESS normalization.
tests/test_accession_normalization.py Adds tests for accession parsing/mapping and FASTA sequence cleaning helper.
mokume/quantification/ibaq.py Normalizes protein identifiers when matching FASTA entries for iBAQ/TPA calculations.
mokume/pipeline/stages.py Adds DE method auto-selection and optional peptide-count loading for DEqMS; pins pandas groupby observed.
mokume/pipeline/features_to_proteins.py Extends pipeline entrypoint parameters/config wiring for new DE and DirectLFQ options.
mokume/pipeline/config.py Adds ImputationConfig, extends DEConfig with fdr_method, and adds imputation to PipelineConfig.
mokume/normalization/loess.py Introduces LOESS normalization implementation and functional wrapper.
mokume/normalization/init.py Exports LOESS normalization API.
mokume/io/fasta.py Applies accession normalization and shared nonstandard-AA stripping in FASTA extraction and MW lookup.
mokume/imputation/censored.py Adds censored-aware imputation methods (MinProb/MinDet/KNN) and missingness classification.
mokume/imputation/init.py Exports new censored-aware imputation functions.
mokume/core/constants.py Enhances get_accession() and adds build_accession_map() for normalization.
mokume/commands/features2proteins.py Expands CLI with batch correction options, new DE methods, and --de-fdr-method; adds argument validation.
mokume/analysis/proda.py Adds simplified proDA-style dropout-aware DE implementation.
mokume/analysis/limrots.py Adds LimROTS DE implementation with bootstrap/permutation optimization.
mokume/analysis/differential_expression.py Refactors DE orchestration, adds IHW support helpers, and integrates LimROTS/DEqMS/proDA dispatch.
mokume/analysis/deqms.py Adds DEqMS implementation (spectra-count eBayes with variance–count modeling).
mokume/analysis/init.py Exports new analysis entrypoints for direct import.
mokume/init.py Suppresses a pyopenms OPENMS_DATA_PATH warning.
docs/user-guide/features2proteins.md Updates user guide examples and option tables for new DE/FDR options.
docs/reference/configuration.md Documents new DE config defaults and fdr_method.
docs/reference/cli.md Updates CLI reference with batch correction and new DE/FDR options.
README.md Documents new analysis/imputation/normalization modules and provides usage examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 204 to 217
def _finalize_results(self, de_df: pd.DataFrame) -> pd.DataFrame:
"""Apply FDR correction and classify significance."""
# FDR correction (Benjamini-Hochberg)
reject, adj_pvalues, _, _ = multipletests(
de_df["pvalue"].values, method="fdr_bh"
)
de_df["adj_pvalue"] = adj_pvalues
if self.fdr_method == "ihw" and "adj_pvalue" not in de_df.columns:
de_df["adj_pvalue"] = _ihw_correction(
de_df["pvalue"].values,
de_df,
alpha=self.fdr_threshold,
)
elif "adj_pvalue" not in de_df.columns:
# Benjamini-Hochberg (default)
de_df["adj_pvalue"] = multipletests(
de_df["pvalue"].values, method="fdr_bh"
)[1]

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fdr_method is effectively ignored for all DE methods right now: run_limrots, run_deqms, and run_proda each return an adj_pvalue column, and _finalize_results only applies IHW/BH when adj_pvalue is missing. This means --de-fdr-method ihw (and even BH recomputation) won’t take effect. Consider always recomputing/overriding adj_pvalue based on self.fdr_method (optionally preserving any method-specific adjusted values under a different column name).

Copilot uses AI. Check for mistakes.
Comment thread mokume/analysis/proda.py
Comment on lines +41 to +43
if n_obs < 5 or n_obs == n_total:
return np.nanmedian(observed) - 2.0, 1.0

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_fit_dropout_curve() calls np.nanmedian(observed) even when observed is empty (n_obs == 0), which yields nan parameters and can propagate NaNs through the DE results. Handle the all-missing case explicitly (e.g., return a fixed fallback rho/zeta based on a global default) before calling nanmedian.

Copilot uses AI. Check for mistakes.
Comment thread README.md Outdated
# Multiple contrasts
contrasts = [("Treatment", "Control"), ("Drug", "Control")]
results = de.run_comparisons(protein_df, sample_to_condition, contrasts)
# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README example for run_comparisons() shows keys like "Treatment_vs_Control", but the implementation uses f"{contrast[0]}-{contrast[1]}" (e.g., "Treatment-Control"). Update the example to match the actual keys returned so users can index results correctly.

Suggested change
# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}
# Returns dict: {"Treatment-Control": DataFrame, "Drug-Control": DataFrame}

Copilot uses AI. Check for mistakes.
Comment thread README.md Outdated
Comment on lines +229 to +235
imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0)

# MinDet: replace NaN with per-column quantile
imputed = impute_censored(log2_matrix, method="mindet", q=0.01)

# KNN: k-nearest neighbor imputation
imputed = impute_censored(log2_matrix, method="knn", k=10)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impute_censored README examples use keyword args (q, tune_sigma, k) that don’t match the actual function signature (quantile, shift, scale, n_neighbors, …). As written, these snippets will raise TypeError. Please update the example arguments to the real parameter names.

Suggested change
imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0)
# MinDet: replace NaN with per-column quantile
imputed = impute_censored(log2_matrix, method="mindet", q=0.01)
# KNN: k-nearest neighbor imputation
imputed = impute_censored(log2_matrix, method="knn", k=10)
imputed = impute_censored(log2_matrix, method="minprob", quantile=0.01, scale=1.0)
# MinDet: replace NaN with per-column quantile
imputed = impute_censored(log2_matrix, method="mindet", quantile=0.01)
# KNN: k-nearest neighbor imputation
imputed = impute_censored(log2_matrix, method="knn", n_neighbors=10)

Copilot uses AI. Check for mistakes.
Comment on lines +183 to +203
class TestIHW:
def test_ihw_via_de(self):
mat, sa, sb, _ = _make_protein_matrix()
wide, s2c = _make_wide_df(mat, sa, sb)
de = DifferentialExpression(
method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True,
)
result = de.run(wide, s2c, ("A", "B"))
assert len(result) > 0
assert "adj_pvalue" in result.columns

def test_ihw_fallback_on_small_data(self):
# Very few proteins → should fall back to BH
mat, sa, sb, _ = _make_protein_matrix(n_proteins=5)
wide, s2c = _make_wide_df(mat, sa, sb)
de = DifferentialExpression(
method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True,
)
result = de.run(wide, s2c, ("A", "B"))
assert len(result) > 0

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IHW tests currently only assert that an adj_pvalue column exists, which would also pass if BH is being applied (or if IHW is ignored). Since --de-fdr-method ihw is a user-facing option, add an assertion that specifically verifies IHW is applied (or that the BH fallback path is taken for small datasets).

Copilot uses AI. Check for mistakes.
@ypriverol ypriverol merged commit 48e0249 into bigbio:dev Apr 18, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants