Skip to content

feat: add extra TGA cnv report#1648

Draft
mathiasbio wants to merge 153 commits intomove_cosmicfrom
modify_cnv_report
Draft

feat: add extra TGA cnv report#1648
mathiasbio wants to merge 153 commits intomove_cosmicfrom
modify_cnv_report

Conversation

@mathiasbio
Copy link
Collaborator

Description

[PR description]

Added

  • [Description]

Changed

  • [Description]

Fixed

  • [Description]

Removed

  • [Description]

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • [Document Name]

Tests

Feature Tests

  • N/A
  • Test [Description]
    • [Screenshot]

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

  • Atlas documentation
    • N/A
    • Updated: [Link]
  • Web portal for Clinical Genomics
    • N/A
    • Updated: [Link]

Panel of Normal specific criteria

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Validation criteria

Validation criteria to be added to validation report PR: [LINK-TO-VALIDATION-REPORT-PR from the validations repository]

Version specific criteria

  • Text here or N/A

Important

One of the below checkboxes for validation need to be checked

  • Added version specific validation criteria to validation report
  • Changes validated in standard sections: [validation-section]
  • Validation criteria not necessary

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Validation criteria
    • Completed the validation criteria section of the template.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Validation criteria
    • The author has completed the validation criteria section of the template
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

Copy link
Contributor

@fevac fevac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A huge amount of work here but I think this still needs some work. I don't think we should add html code as such but use a library instead. Then I also think having interactive plots is nicer but that could be left for the future. There's a lot of code here and it's a bit difficult to review the main files in the script. I think for the future it would be better to get ask for reviews earlier in the process before adding so much complexity if the PR cannot split into multiple ones

Comment on lines 25 to 34
CURATED_CANCER_GENES: set[str] = {
"TP53", # <--- requested by cust087
"DLEU1", # <--- requested by cust087
"DLEU2",
"RB1", # <--- requested by cust087
"KMT2D",
"KMT2A",
"ATM" # <--- requested by cust087
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be outside of the codebase for easy update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be added to the cancer gene list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be it's own argument in that case, because that gene-list has a very particular format from oncokb. But yeah I think that would be ideal. Want me to move it to a file and open another issue in CG and servers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the logic of this is that all genes that are cancer-genes will be marked in the plots (if the gene is > [number of targets]) . That's the main point of this, and not all of cust087 genes of interest are in oncokb which is why this amendment is done. This wouldn't exclude other genes from being marked however, they can still be marked if they are > [number of targets] and are overlapping CNVs.

At least that's the logic I'm thinking now...but it might change...

df: pd.DataFrame,
columns: Iterable[str],
decimals: int = 3,
inplace: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why do you want to keep the original? is it needed? otherwise just remove this to simplify the code


Missing columns are ignored, non-numeric values are safely coerced, and NaNs are preserved.
"""
target = df if inplace else df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to do this if you use the inplace param within pandas

if not existing:
return target

# Ensure numeric (safe if already numeric, safe if NaN present)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we don't have numeric values? we should know the format of the file we're using. Is it mixed?

Comment on lines 61 to 63
# =============================================================================
# Small helpers to reduce duplication and keep csv_to_html_table readable
# =============================================================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# =============================================================================
# Small helpers to reduce duplication and keep csv_to_html_table readable
# =============================================================================

we don't need that

# =============================================================================


def _png_to_data_uri(png_path: str | Path | None) -> str | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a mix of styles here. Balsamic usually used the Optional notation. I do like this style better (also used in the cg code) but it's good practice to stick to one style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should minimise the amount of html coding and use standard libraries instead, to make it less prone to errors, more readable and more maintenable. In python I've used plotly and jinja2 in the past. It can also be done in R. There might be better libraries too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here, I would use plotly as you can make interactive plots instead o f static ones

import numpy as np
import pandas as pd
from pandas.errors import EmptyDataError
import fitz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this library isn't maintained

Copy link
Contributor

@fevac fevac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hej, sorry but this code needs a lot of work and the logic also needs to be discussed. It's currently very messy and hard to understand. I lot of things could be simplified and re-structured. Let's go through it together.

Try to stick to the 1 function -> 1 task as much as possible. Also the logic is a little bit all over the place. Plotting functions have a lot of other logic in it, including parsing and renaming. I would say that in general you would like to:

  1. validate input files if this is needed
  2. parse files to generate ready to use intermediate files or other type of data
  3. plot
  4. report

Also there is a lot of decisions done in the plotting and reporting regarding data transformation and thresholds that it would be good to discuss. We try to do some pair programming and go through it together while we discuss

Comment on lines +26 to +34
CURATED_CANCER_GENES: set[str] = {
"TP53", # <--- requested by cust087
"DLEU1", # <--- requested by cust087
"DLEU2",
"RB1", # <--- requested by cust087
"KMT2D",
"KMT2A",
"ATM", # <--- requested by cust087
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data should be separate from the codebase

Comment on lines +135 to +136
if chr_col not in cnr.columns or chr_col not in pon.columns:
raise ValueError(f"Both cnr and pon must contain chromosome column '{chr_col}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always be the case as we know the input data. So it's better to remove it for clarity.

Comment on lines +138 to +139
if spread_col not in pon.columns:
raise ValueError(f"PON is missing required column '{spread_col}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always be the case as we know the input data. So it's better to remove it for clarity.

Comment on lines +141 to +142
cnr[chr_col] = cnr[chr_col].astype(str).str.replace("^chr", "", regex=True)
pon[chr_col] = pon[chr_col].astype(str).str.replace("^chr", "", regex=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be the case. As far as I see these files do not include chr so this just makes the code messier

cnr[chr_col] = cnr[chr_col].astype(str).str.replace("^chr", "", regex=True)
pon[chr_col] = pon[chr_col].astype(str).str.replace("^chr", "", regex=True)

pon = pon.dropna(subset=[spread_col]).copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should ever be the case, so better remove this line

gchunk = gchunk.rename(columns=rename_map)

MIN_GENE_TARGETS = 5
MIN_GENE_TARGETS_CANCER = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a parameter

Comment on lines +938 to +947
# Rename columns for plotting:
rename_map = {
"cnvkit_seg_start": "seg_start",
"cnvkit_seg_end": "seg_end",
"cnvkit_seg_raw_log2": "seg_log2",
}
targets_col = "n.targets"

gdf = gdf.rename(columns=rename_map)
gchunk = gchunk.rename(columns=rename_map)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is all this needed?


chr_order = _as_chr_order(include_y)

gdf = _normalize_is_cancer_gene(gdf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really normalizing is transforming into booleans

Comment on lines +970 to +972
# Spread filter (applied per-chr later as well); keep spread col always
if use_pon:
merged = merged[merged["spread"] <= spread_thresh].copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why filtering?

# =============================================================================


def plot_chromosomes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be only plotting not parsing and other logic. In general you would like to keep code organise, short and concise. Think about the single responsiblity principle: one function -> 1 task

Comment on lines +337 to +339
1) Cancer genes are ALWAYS highlighted if they meet the cancer target threshold.
2) If highlight_only_cancer is False, also highlight non-cancer genes with LOH/CNV
if they meet the non-cancer target threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly and to make the logic easy to follow the logic only needs
gene_targets_threshold (num) and gene_in_list (bool)

Comment on lines +381 to +393
"""
Construct pseudo-position x coordinate using variable bin widths.

Width rules:
- Antitarget bins → anti_factor
- Highlighted gene bins → 1.0
- Other target bins → neutral_target_factor

Adds:
type
bin_width
x_coord
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this pseudoposition

exon_map = load_refgene_exons(refgene_path)
genes_df = _add_exons_hit_column(
genes_df,
exon_map,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which transcript are you using? all? first? random?

@mathiasbio mathiasbio changed the base branch from develop to move_cosmic February 16, 2026 13:46
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[User Story] Improve CNV report

2 participants