Conversation
fevac
left a comment
There was a problem hiding this comment.
A huge amount of work here but I think this still needs some work. I don't think we should add html code as such but use a library instead. Then I also think having interactive plots is nicer but that could be left for the future. There's a lot of code here and it's a bit difficult to review the main files in the script. I think for the future it would be better to get ask for reviews earlier in the process before adding so much complexity if the PR cannot split into multiple ones
| CURATED_CANCER_GENES: set[str] = { | ||
| "TP53", # <--- requested by cust087 | ||
| "DLEU1", # <--- requested by cust087 | ||
| "DLEU2", | ||
| "RB1", # <--- requested by cust087 | ||
| "KMT2D", | ||
| "KMT2A", | ||
| "ATM" # <--- requested by cust087 | ||
| } | ||
|
|
There was a problem hiding this comment.
this should be outside of the codebase for easy update
There was a problem hiding this comment.
can it be added to the cancer gene list?
There was a problem hiding this comment.
I think it should be it's own argument in that case, because that gene-list has a very particular format from oncokb. But yeah I think that would be ideal. Want me to move it to a file and open another issue in CG and servers?
There was a problem hiding this comment.
At the moment the logic of this is that all genes that are cancer-genes will be marked in the plots (if the gene is > [number of targets]) . That's the main point of this, and not all of cust087 genes of interest are in oncokb which is why this amendment is done. This wouldn't exclude other genes from being marked however, they can still be marked if they are > [number of targets] and are overlapping CNVs.
At least that's the logic I'm thinking now...but it might change...
| df: pd.DataFrame, | ||
| columns: Iterable[str], | ||
| decimals: int = 3, | ||
| inplace: bool = False, |
There was a problem hiding this comment.
- Why do you want to keep the original? is it needed? otherwise just remove this to simplify the code
|
|
||
| Missing columns are ignored, non-numeric values are safely coerced, and NaNs are preserved. | ||
| """ | ||
| target = df if inplace else df.copy() |
There was a problem hiding this comment.
you don't need to do this if you use the inplace param within pandas
| if not existing: | ||
| return target | ||
|
|
||
| # Ensure numeric (safe if already numeric, safe if NaN present) |
There was a problem hiding this comment.
why would we don't have numeric values? we should know the format of the file we're using. Is it mixed?
| # ============================================================================= | ||
| # Small helpers to reduce duplication and keep csv_to_html_table readable | ||
| # ============================================================================= |
There was a problem hiding this comment.
| # ============================================================================= | |
| # Small helpers to reduce duplication and keep csv_to_html_table readable | |
| # ============================================================================= |
we don't need that
| # ============================================================================= | ||
|
|
||
|
|
||
| def _png_to_data_uri(png_path: str | Path | None) -> str | None: |
There was a problem hiding this comment.
There's a mix of styles here. Balsamic usually used the Optional notation. I do like this style better (also used in the cg code) but it's good practice to stick to one style.
There was a problem hiding this comment.
I think we should minimise the amount of html coding and use standard libraries instead, to make it less prone to errors, more readable and more maintenable. In python I've used plotly and jinja2 in the past. It can also be done in R. There might be better libraries too
There was a problem hiding this comment.
similarly here, I would use plotly as you can make interactive plots instead o f static ones
…AMIC into modify_cnv_report
| import numpy as np | ||
| import pandas as pd | ||
| from pandas.errors import EmptyDataError | ||
| import fitz |
There was a problem hiding this comment.
it looks like this library isn't maintained
fevac
left a comment
There was a problem hiding this comment.
Hej, sorry but this code needs a lot of work and the logic also needs to be discussed. It's currently very messy and hard to understand. I lot of things could be simplified and re-structured. Let's go through it together.
Try to stick to the 1 function -> 1 task as much as possible. Also the logic is a little bit all over the place. Plotting functions have a lot of other logic in it, including parsing and renaming. I would say that in general you would like to:
- validate input files if this is needed
- parse files to generate ready to use intermediate files or other type of data
- plot
- report
Also there is a lot of decisions done in the plotting and reporting regarding data transformation and thresholds that it would be good to discuss. We try to do some pair programming and go through it together while we discuss
| CURATED_CANCER_GENES: set[str] = { | ||
| "TP53", # <--- requested by cust087 | ||
| "DLEU1", # <--- requested by cust087 | ||
| "DLEU2", | ||
| "RB1", # <--- requested by cust087 | ||
| "KMT2D", | ||
| "KMT2A", | ||
| "ATM", # <--- requested by cust087 | ||
| } |
There was a problem hiding this comment.
data should be separate from the codebase
| if chr_col not in cnr.columns or chr_col not in pon.columns: | ||
| raise ValueError(f"Both cnr and pon must contain chromosome column '{chr_col}'") |
There was a problem hiding this comment.
this should always be the case as we know the input data. So it's better to remove it for clarity.
| if spread_col not in pon.columns: | ||
| raise ValueError(f"PON is missing required column '{spread_col}'") |
There was a problem hiding this comment.
this should always be the case as we know the input data. So it's better to remove it for clarity.
| cnr[chr_col] = cnr[chr_col].astype(str).str.replace("^chr", "", regex=True) | ||
| pon[chr_col] = pon[chr_col].astype(str).str.replace("^chr", "", regex=True) |
There was a problem hiding this comment.
this should not be the case. As far as I see these files do not include chr so this just makes the code messier
| cnr[chr_col] = cnr[chr_col].astype(str).str.replace("^chr", "", regex=True) | ||
| pon[chr_col] = pon[chr_col].astype(str).str.replace("^chr", "", regex=True) | ||
|
|
||
| pon = pon.dropna(subset=[spread_col]).copy() |
There was a problem hiding this comment.
I don't think this should ever be the case, so better remove this line
| gchunk = gchunk.rename(columns=rename_map) | ||
|
|
||
| MIN_GENE_TARGETS = 5 | ||
| MIN_GENE_TARGETS_CANCER = 5 |
| # Rename columns for plotting: | ||
| rename_map = { | ||
| "cnvkit_seg_start": "seg_start", | ||
| "cnvkit_seg_end": "seg_end", | ||
| "cnvkit_seg_raw_log2": "seg_log2", | ||
| } | ||
| targets_col = "n.targets" | ||
|
|
||
| gdf = gdf.rename(columns=rename_map) | ||
| gchunk = gchunk.rename(columns=rename_map) |
|
|
||
| chr_order = _as_chr_order(include_y) | ||
|
|
||
| gdf = _normalize_is_cancer_gene(gdf) |
There was a problem hiding this comment.
this is not really normalizing is transforming into booleans
| # Spread filter (applied per-chr later as well); keep spread col always | ||
| if use_pon: | ||
| merged = merged[merged["spread"] <= spread_thresh].copy() |
| # ============================================================================= | ||
|
|
||
|
|
||
| def plot_chromosomes( |
There was a problem hiding this comment.
this should be only plotting not parsing and other logic. In general you would like to keep code organise, short and concise. Think about the single responsiblity principle: one function -> 1 task
| 1) Cancer genes are ALWAYS highlighted if they meet the cancer target threshold. | ||
| 2) If highlight_only_cancer is False, also highlight non-cancer genes with LOH/CNV | ||
| if they meet the non-cancer target threshold. |
There was a problem hiding this comment.
If I understand correctly and to make the logic easy to follow the logic only needs
gene_targets_threshold (num) and gene_in_list (bool)
| """ | ||
| Construct pseudo-position x coordinate using variable bin widths. | ||
|
|
||
| Width rules: | ||
| - Antitarget bins → anti_factor | ||
| - Highlighted gene bins → 1.0 | ||
| - Other target bins → neutral_target_factor | ||
|
|
||
| Adds: | ||
| type | ||
| bin_width | ||
| x_coord | ||
| """ |
There was a problem hiding this comment.
I don't understand this pseudoposition
| exon_map = load_refgene_exons(refgene_path) | ||
| genes_df = _add_exons_hit_column( | ||
| genes_df, | ||
| exon_map, |
There was a problem hiding this comment.
which transcript are you using? all? first? random?
|




Description
[PR description]
Added
Changed
Fixed
Removed
Documentation
Tests
Feature Tests
Pipeline Integrity Tests
.hkfile)Clinical Genomics Stockholm
Documentation
Panel of Normal specific criteria
User Changes
Infrastructure Changes
Validation criteria
Validation criteria to be added to validation report PR: [LINK-TO-VALIDATION-REPORT-PR from the validations repository]
Version specific criteria
Important
One of the below checkboxes for validation need to be checked
Checklist
Important
Ensure that all checkboxes below are ticked before merging.
For Developers
For Reviewers
conditions where applicable, with satisfactory results.