feat: add extra TGA cnv report by mathiasbio · Pull Request #1648 · Clinical-Genomics/BALSAMIC

mathiasbio · 2026-01-27T10:28:45Z

Description

[PR description]

Added

[Description]

Changed

[Description]

Fixed

[Description]

Removed

[Description]

Documentation

N/A
Updated Balsamic documentation to reflect the changes as needed for this PR.
- [Document Name]

Tests

Feature Tests

N/A
Test [Description]
- [Screenshot]

Pipeline Integrity Tests

Report deliver (generation of the .hk file)
- N/A
- Verified
TGA T/O Workflow
- N/A
- Verified
TGA T/N Workflow
- N/A
- Verified
UMI T/O Workflow
- N/A
- Verified
UMI T/N Workflow
- N/A
- Verified
WGS T/O Workflow
- N/A
- Verified
WGS T/N Workflow
- N/A
- Verified
QC Workflow
- N/A
- Verified
PON Workflow
- N/A
- Verified

Clinical Genomics Stockholm

Documentation

Atlas documentation
- N/A
- Updated: [Link]
Web portal for Clinical Genomics
- N/A
- Updated: [Link]

Panel of Normal specific criteria

The PR includes the addition of a new Panel of Normals
The samples have been verified to adhere to the sample selection criteria on Atlas PoN creation instructions for Balsamic

User Changes

N/A
This PR affects the output files or results.
- User feedback is considered unnecessary because [Justification].
- Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

Stored files in Housekeeper
- N/A
- Updated: [Link]
CG (CLI and delivered/uploaded files)
- N/A
- Updated: [Link]
Servers (configuration files on Hasta)
- N/A
- Updated: [Link]
Scout interface
- N/A
- Updated: [Link]

Validation criteria

Validation criteria to be added to validation report PR: [LINK-TO-VALIDATION-REPORT-PR from the validations repository]

Version specific criteria

Text here or N/A

Important

One of the below checkboxes for validation need to be checked

Added version specific validation criteria to validation report
Changes validated in standard sections: [validation-section]
Validation criteria not necessary

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

PR Description
- Provided a comprehensive description of the PR.
- Linked relevant user stories or issues to the PR.
Documentation
- Verified and updated documentation if necessary.
Validation criteria
- Completed the validation criteria section of the template.
Tests
- Described and tested the functionality addressed in the PR.
- Ensured integration of the new code with existing workflows.
- Confirmed that meaningful unit tests were added for the changes introduced.
- Checked that the PR has successfully passed all relevant code smells and coverage checks.
Review
- Addressed and resolved all the feedback provided during the code review process.
- Obtained final approval from designated reviewers.

For Reviewers

Code
- Code implements the intended features or fixes the reported issue.
- Code follows the project's coding standards and style guide.
Documentation
- Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
Validation criteria
- The author has completed the validation criteria section of the template
Tests
- The author provided a description of their manual testing, including consideration of edge cases and boundary
  conditions where applicable, with satisfactory results.
Review
- Confirmed that the developer has addressed all the comments during the code review.

fevac

A huge amount of work here but I think this still needs some work. I don't think we should add html code as such but use a library instead. Then I also think having interactive plots is nicer but that could be left for the future. There's a lot of code here and it's a bit difficult to review the main files in the script. I think for the future it would be better to get ask for reviews earlier in the process before adding so much complexity if the PR cannot split into multiple ones

fevac · 2026-02-05T08:43:09Z

BALSAMIC/assets/scripts/cnv_qc_report.py

+CURATED_CANCER_GENES: set[str] = {
+    "TP53",  # <--- requested by cust087
+    "DLEU1",  # <--- requested by cust087
+    "DLEU2",
+    "RB1",  # <--- requested by cust087
+    "KMT2D",
+    "KMT2A",
+    "ATM" # <--- requested by cust087
+}
+


this should be outside of the codebase for easy update

can it be added to the cancer gene list?

I think it should be it's own argument in that case, because that gene-list has a very particular format from oncokb. But yeah I think that would be ideal. Want me to move it to a file and open another issue in CG and servers?

At the moment the logic of this is that all genes that are cancer-genes will be marked in the plots (if the gene is > [number of targets]) . That's the main point of this, and not all of cust087 genes of interest are in oncokb which is why this amendment is done. This wouldn't exclude other genes from being marked however, they can still be marked if they are > [number of targets] and are overlapping CNVs.

At least that's the logic I'm thinking now...but it might change...

fevac · 2026-02-05T09:20:46Z

BALSAMIC/assets/scripts/cnv_qc_report.py

+    df: pd.DataFrame,
+    columns: Iterable[str],
+    decimals: int = 3,
+    inplace: bool = False,


Why do you want to keep the original? is it needed? otherwise just remove this to simplify the code

fevac · 2026-02-05T09:21:18Z

BALSAMIC/assets/scripts/cnv_qc_report.py

+
+    Missing columns are ignored, non-numeric values are safely coerced, and NaNs are preserved.
+    """
+    target = df if inplace else df.copy()


you don't need to do this if you use the inplace param within pandas

fevac · 2026-02-05T09:22:33Z

BALSAMIC/assets/scripts/cnv_qc_report.py

+    if not existing:
+        return target
+
+    # Ensure numeric (safe if already numeric, safe if NaN present)


why would we don't have numeric values? we should know the format of the file we're using. Is it mixed?

fevac · 2026-02-05T09:23:53Z

BALSAMIC/assets/scripts/cnv_qc_report.py

+# =============================================================================
+# Small helpers to reduce duplication and keep csv_to_html_table readable
+# =============================================================================


Suggested change

# =============================================================================

# Small helpers to reduce duplication and keep csv_to_html_table readable

# =============================================================================

we don't need that

fevac · 2026-02-05T09:26:50Z

BALSAMIC/assets/scripts/cnv_qc_report.py

+# =============================================================================
+
+
+def _png_to_data_uri(png_path: str | Path | None) -> str | None:


There's a mix of styles here. Balsamic usually used the Optional notation. I do like this style better (also used in the cg code) but it's good practice to stick to one style.

fevac · 2026-02-05T09:54:19Z

BALSAMIC/assets/scripts/cnv_qc_report.py

I think we should minimise the amount of html coding and use standard libraries instead, to make it less prone to errors, more readable and more maintenable. In python I've used plotly and jinja2 in the past. It can also be done in R. There might be better libraries too

fevac · 2026-02-05T10:03:12Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

similarly here, I would use plotly as you can make interactive plots instead o f static ones

…AMIC into modify_cnv_report

fevac · 2026-02-11T10:16:24Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_utils.py

+import numpy as np
+import pandas as pd
+from pandas.errors import EmptyDataError
+import fitz


it looks like this library isn't maintained

fevac

Hej, sorry but this code needs a lot of work and the logic also needs to be discussed. It's currently very messy and hard to understand. I lot of things could be simplified and re-structured. Let's go through it together.

Try to stick to the 1 function -> 1 task as much as possible. Also the logic is a little bit all over the place. Plotting functions have a lot of other logic in it, including parsing and renaming. I would say that in general you would like to:

validate input files if this is needed
parse files to generate ready to use intermediate files or other type of data
plot
report

Also there is a lot of decisions done in the plotting and reporting regarding data transformation and thresholds that it would be good to discuss. We try to do some pair programming and go through it together while we discuss

fevac · 2026-02-12T08:29:34Z

BALSAMIC/assets/scripts/cnv_report/cnv_qc_report.py

+CURATED_CANCER_GENES: set[str] = {
+    "TP53",  # <--- requested by cust087
+    "DLEU1",  # <--- requested by cust087
+    "DLEU2",
+    "RB1",  # <--- requested by cust087
+    "KMT2D",
+    "KMT2A",
+    "ATM",  # <--- requested by cust087
+}


data should be separate from the codebase

fevac · 2026-02-12T09:08:42Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+    if chr_col not in cnr.columns or chr_col not in pon.columns:
+        raise ValueError(f"Both cnr and pon must contain chromosome column '{chr_col}'")


this should always be the case as we know the input data. So it's better to remove it for clarity.

fevac · 2026-02-12T09:09:20Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+    if spread_col not in pon.columns:
+        raise ValueError(f"PON is missing required column '{spread_col}'")


this should always be the case as we know the input data. So it's better to remove it for clarity.

fevac · 2026-02-12T09:12:56Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+    cnr[chr_col] = cnr[chr_col].astype(str).str.replace("^chr", "", regex=True)
+    pon[chr_col] = pon[chr_col].astype(str).str.replace("^chr", "", regex=True)


this should not be the case. As far as I see these files do not include chr so this just makes the code messier

fevac · 2026-02-12T09:17:48Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+    cnr[chr_col] = cnr[chr_col].astype(str).str.replace("^chr", "", regex=True)
+    pon[chr_col] = pon[chr_col].astype(str).str.replace("^chr", "", regex=True)
+
+    pon = pon.dropna(subset=[spread_col]).copy()


I don't think this should ever be the case, so better remove this line

fevac · 2026-02-13T09:15:59Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+    gchunk = gchunk.rename(columns=rename_map)
+
+    MIN_GENE_TARGETS = 5
+    MIN_GENE_TARGETS_CANCER = 5


this should be a parameter

fevac · 2026-02-13T09:16:21Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+    # Rename columns for plotting:
+    rename_map = {
+        "cnvkit_seg_start": "seg_start",
+        "cnvkit_seg_end": "seg_end",
+        "cnvkit_seg_raw_log2": "seg_log2",
+    }
+    targets_col = "n.targets"
+
+    gdf = gdf.rename(columns=rename_map)
+    gchunk = gchunk.rename(columns=rename_map)


why is all this needed?

fevac · 2026-02-13T09:19:54Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+
+    chr_order = _as_chr_order(include_y)
+
+    gdf = _normalize_is_cancer_gene(gdf)


this is not really normalizing is transforming into booleans

fevac · 2026-02-13T09:22:22Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+    # Spread filter (applied per-chr later as well); keep spread col always
+    if use_pon:
+        merged = merged[merged["spread"] <= spread_thresh].copy()


why filtering?

fevac · 2026-02-13T09:33:04Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+# =============================================================================
+
+
+def plot_chromosomes(


this should be only plotting not parsing and other logic. In general you would like to keep code organise, short and concise. Think about the single responsiblity principle: one function -> 1 task

fevac · 2026-02-16T11:09:59Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+      1) Cancer genes are ALWAYS highlighted if they meet the cancer target threshold.
+      2) If highlight_only_cancer is False, also highlight non-cancer genes with LOH/CNV
+         if they meet the non-cancer target threshold.


If I understand correctly and to make the logic easy to follow the logic only needs
gene_targets_threshold (num) and gene_in_list (bool)

fevac · 2026-02-16T11:11:06Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_plotting.py

+    """
+    Construct pseudo-position x coordinate using variable bin widths.
+
+    Width rules:
+      - Antitarget bins → anti_factor
+      - Highlighted gene bins → 1.0
+      - Other target bins → neutral_target_factor
+
+    Adds:
+      type
+      bin_width
+      x_coord
+    """


I don't understand this pseudoposition

fevac · 2026-02-16T11:29:00Z

BALSAMIC/assets/scripts/cnv_report/cnv_report_utils.py

+        exon_map = load_refgene_exons(refgene_path)
+        genes_df = _add_exons_hit_column(
+            genes_df,
+            exon_map,


which transcript are you using? all? first? random?

…nto move_cosmic

sonarqubecloud · 2026-02-16T16:03:43Z

Quality Gate failed

Failed conditions
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

mathiasbio added 30 commits November 24, 2025 15:53

add svg

9cc877a

add cancer genelist

ef57884

fix

563727b

add caseid to plot

7a6b842

remove gene list doesnt work

438e232

add cnvplot script

fe14dd0

upgrade matplotlib

a94bca9

add new files

487b049

add new rule

7428cae

bug fix

438c44f

fix

a3ade66

fic

5777dea

fix

aae0520

fix

9e53a6a

fix

fbcf112

test

f9a2beb

fix

4918131

fix

653df5f

env

a397366

remove singularity

15251ee

test

0bec280

change pdf tool

de763df

move

f937096

install pdf

03cbe87

fix

ceff2e1

fix

d64dec1

fix

b5d9b88

fix

495f4c6

black and format

c42f21e

test

917113a

fevac reviewed Feb 5, 2026

View reviewed changes

beatrizsavinhas mentioned this pull request Feb 5, 2026

feat: automatic download of cytoband coordinates file via init command #1651

Merged

44 tasks

mathiasbio added 12 commits February 5, 2026 13:31

refactor html functions into jinja template

258143a

add subdir to path

8b5b9c5

refactor and add column category filters in table

d801968

fix highlighting

cd24181

Merge branch 'develop' into modify_cnv_report

5860a54

Merge branch 'modify_cnv_report' of github.com:Clinical-Genomics/BALS…

eb7cd1c

…AMIC into modify_cnv_report

merge develop and adjust cytoband source

daa63bb

clean

866eeb7

fix model

d11caa1

fix

03db4a0

adress review comments

bc745f7

black

6429139

fevac reviewed Feb 11, 2026

View reviewed changes

refactor js

34386b0

fevac reviewed Feb 13, 2026

View reviewed changes

larger bed

d4b6e96

fevac reviewed Feb 16, 2026

View reviewed changes

mathiasbio added 2 commits February 16, 2026 14:45

Merge branch 'move_cosmic' of github.com:Clinical-Genomics/BALSAMIC i…

2487ec2

…nto move_cosmic

Merge branch 'move_cosmic' into modify_cnv_report

e5e8e93

mathiasbio changed the base branch from develop to move_cosmic February 16, 2026 13:46

mathiasbio added 5 commits February 16, 2026 15:09

black

a645e34

fix

667fee0

Merge branch 'move_cosmic' of github.com:Clinical-Genomics/BALSAMIC i…

5f34010

…nto move_cosmic

Merge branch 'move_cosmic' into modify_cnv_report

14298f1

fix

4688850

	# =============================================================================
	# Small helpers to reduce duplication and keep csv_to_html_table readable
	# =============================================================================

		# =============================================================================


		def _png_to_data_uri(png_path: str \| Path \| None) -> str \| None:

		if chr_col not in cnr.columns or chr_col not in pon.columns:
		raise ValueError(f"Both cnr and pon must contain chromosome column '{chr_col}'")

		if spread_col not in pon.columns:
		raise ValueError(f"PON is missing required column '{spread_col}'")

		cnr[chr_col] = cnr[chr_col].astype(str).str.replace("^chr", "", regex=True)
		pon[chr_col] = pon[chr_col].astype(str).str.replace("^chr", "", regex=True)


		chr_order = _as_chr_order(include_y)

		gdf = _normalize_is_cancer_gene(gdf)

		# =============================================================================


		def plot_chromosomes(

Conversation

mathiasbio commented Jan 27, 2026

Description

Added

Changed

Fixed

Removed

Documentation

Tests

Feature Tests

Pipeline Integrity Tests

Clinical Genomics Stockholm

Documentation

Panel of Normal specific criteria

User Changes

Infrastructure Changes

Validation criteria

Checklist

For Developers

For Reviewers

Uh oh!

fevac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fevac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Feb 16, 2026

Quality Gate failed

Uh oh!