Skip to content

feat: add igv cnv files for WGS#1649

Open
mathiasbio wants to merge 34 commits intodevelopfrom
add_igv_cnv_files
Open

feat: add igv cnv files for WGS#1649
mathiasbio wants to merge 34 commits intodevelopfrom
add_igv_cnv_files

Conversation

@mathiasbio
Copy link
Collaborator

@mathiasbio mathiasbio commented Jan 28, 2026

Description

Add BAF and Log2 files from GATK to be stored in housekeeper and delivered to Caesar. See linked issue.

Added binning of 100-bp windows from GATK collect read-counts to reduce computational load when viewing in IGV (merging 5 bins by default)

Added

  • gnomad af 5 baf bedgraph for viewing in IGV
  • denoised gatk readcounts log2 bedgraph for viewing in IGV

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • docs/balsamic_sv_cnv.rst

Tests

Feature Tests

  • Test: WGS TN case finished successfully and produces the new files
  • Test: New files from WGS TN case can be stored in housekeeper (with associated hermes update) with the correct delivery tags:
    • [Screenshot]

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

  • Atlas documentation
    • N/A
    • Updated: [Link]
  • Web portal for Clinical Genomics
    • N/A
    • Updated: [Link]

Panel of Normal specific criteria

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Validation criteria

Validation criteria to be added to validation report PR: [LINK-TO-VALIDATION-REPORT-PR from the validations repository]

Version specific criteria

  • Text here or N/A

Important

One of the below checkboxes for validation need to be checked

  • Added version specific validation criteria to validation report
  • Changes validated in standard sections: [validation-section]
  • Validation criteria not necessary

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Validation criteria
    • Completed the validation criteria section of the template.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Validation criteria
    • The author has completed the validation criteria section of the template
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

@mathiasbio mathiasbio self-assigned this Jan 28, 2026
@mathiasbio mathiasbio linked an issue Jan 28, 2026 that may be closed by this pull request
3 tasks
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.38%. Comparing base (7d529e6) to head (32d9c88).
⚠️ Report is 154 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1649      +/-   ##
===========================================
- Coverage    99.48%   99.38%   -0.10%     
===========================================
  Files           40       40              
  Lines         1932     1967      +35     
===========================================
+ Hits          1922     1955      +33     
- Misses          10       12       +2     
Flag Coverage Δ
unittests 99.38% <100.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mathiasbio mathiasbio changed the base branch from master to develop January 28, 2026 14:29
Copy link
Contributor

@fevac fevac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry! I accidentally started to review this thinking it was the other PR. So I won't continue for now

@mathiasbio mathiasbio marked this pull request as ready for review February 9, 2026 11:55
@mathiasbio mathiasbio requested a review from a team as a code owner February 9, 2026 11:55
Copy link

@beatrizsavinhas beatrizsavinhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unfortunately don't fully understand the context around generating these files from reading the PR/user story description, so I can't really comment on logic or parameters used...
But I added some comments on the code and suggestions for better readability!

If we meet to discuss the PR, maybe I can give a more complete and meaningful review! It would also be important to look at the test results.

Comment on lines 14 to 19
def open_output(path: str | Path) -> ContextManager[TextIO]:
"""Open output file; '-' means stdout (not closed)."""
p = str(path)
if p == "-":
return nullcontext(sys.stdout)
return open(p, "w", encoding="utf-8")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function necessary? 🤔
I believe you can just call open(file_path, "w", encoding="utf-8") even if the file is already open and the the path is of type Path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to handle the possibility of not providing an output path but just piping output to standard out. But I don't think it's necessary since this is a script to be used in a pipeline with a predictable input and output structure! We don't need flexibility, so I'll remove! 🙏

Comment on lines 56 to 62
vcf_path: str | Path,
bedgraph_path: str | Path,
track_name: Optional[str] = None,
) -> int:
"""Convert a VCF into a bedGraph of AF computed from AD/DP."""
n_written = 0
with closing(VCF(str(vcf_path))) as vcf, open_output(bedgraph_path) as fout:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cli function will call this with vcf_path and bedgraph_path as Path, right? Then there is no need to have ambiguous types here.
See also comment above on open_output.

Suggested change
vcf_path: str | Path,
bedgraph_path: str | Path,
track_name: Optional[str] = None,
) -> int:
"""Convert a VCF into a bedGraph of AF computed from AD/DP."""
n_written = 0
with closing(VCF(str(vcf_path))) as vcf, open_output(bedgraph_path) as fout:
vcf_path: Path,
bedgraph_path: Path,
track_name: Optional[str] = None,
) -> int:
"""Convert a VCF into a bedGraph of AF computed from AD/DP."""
n_written = 0
with closing(VCF(vcf_path.as_posix()) as vcf, open_output(bedgraph_path) as fout:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! It should only be string. I've been confused about this so many times, but I think the type=click.Path in the argument is only used to validate the input, but what you actually get out is a string. I'll change the types to string

Comment on lines 67 to 71
record = variant_to_record(variant)
if record == None:
continue
fout.write(record)
n_written += 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above on raising an exception for variant_to_record. It is also possible to print out a warning for the exception if that would be of any use.

Suggested change
record = variant_to_record(variant)
if record == None:
continue
fout.write(record)
n_written += 1
try:
record = variant_to_record(variant)
fout.write(record)
n_written += 1
except:
continue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variants into this script are a bit unconventional in that they are forced calls on common gnomad population variants, and only used for visualising the allele frequencies in IGV (you can look at the linked issue to find a screenshot). Maybe it could be interesting to log however how many total variants were skipped, I'll see if I can add that!

plot_tumor = vcf_dir + "CNV.somatic." + config["analysis"]["case_id"] + ".ascat.tumor.png",
plot_germline = vcf_dir + "CNV.somatic." + config["analysis"]["case_id"] + ".ascat.germline.png",
plot_sunrise = vcf_dir + "CNV.somatic." + config["analysis"]["case_id"] + ".ascat.sunrise.png",
namemap = vcf_dir + "CNV.somatic." + config["analysis"]["case_id"] + ".ascat.sample_name_map",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary for this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! Don't even know how it happened 😂 either way it doesn't matter. But I'll return the comma.

],
"varcall": [
"snakemake_rules/variant_calling/germline_wgs.rule",
"snakemake_rules/variant_calling/igv_files.rule",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The User story mentions only "tumor normal matched WGS analyses". Is this to be applied for TO too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I'll update the userstory, I think the specific case Teresita referred to was tumor+normal, but it should be applicable to tumor only as well

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the test still refers to tga and missing_gens. If it was your intention to change the whole test, could you update the test name too?

Also, what is the reasoning in adding the panel_bed_file parameter now? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember what the issue was with the test. But it failed, and when I looked at it I saw that it wasn't configured correctly. It was supposed to test running tga workflow without GENS input arguments, but it wasn't provided a panel-bed-file which is what configures the workflow as tga. So I just cleaned it up in passing.

Out of context but a small fix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same general advice as above: set explanatory names for variables and avoid ambiguous function returns.

Comment on lines 101 to 102
if bins_per_window <= 0:
raise click.ClickException("--bin-size must be a positive integer")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative with automatic error printing for this is using type=click.IntRange(0, <max>)).

Copy link
Collaborator Author

@mathiasbio mathiasbio Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! It would feel sort of arbitrary to put a maximum value here though, and I'm fine with the current solution

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[User Story] Delivery of additional CNV and BAF files to Caesar

3 participants