Skip to content

d4tools stat gives occasionally very large (incorrect) numbers #113

@Jakob37

Description

@Jakob37

We found this when working with long reads data, where coverage bins tend to be longer than for short reads data. We use d4tools to calculate bin-by-bin coverage. Here, frequently, certain mean coverage values became very large.

Example

$ ./d4tools stat --region 100bp_bins_more.bed example.d4 
chr1    54000   54100   33.05
chr1    54100   54200   32.51
chr1    54200   54300   3083160.33
chr1    54300   54400   3083129.18
chr1    54400   54500   30.16
chr1    54500   54600   30
chr1    54600   54700   30
chr1    54700   54800   30
chr1    54800   54900   29.06
chr1    54900   55000   29.54

Underlying data

$ ./d4tools view --region-file 100bp_bins_more.bed example.d4 
chr1    54000   54005   34
chr1    54005   54100   33
chr1    54100   54151   33
chr1    54151   54200   32
chr1    54200   54215   32
chr1    54215   54300   31
chr1    54300   54400   31
chr1    54400   54416   31
chr1    54416   54500   30
chr1    54500   54600   30
chr1    54600   54700   30
chr1    54700   54800   30
chr1    54800   54806   30
chr1    54806   54900   29
chr1    54900   54946   29
chr1    54946   55000   30

It looks like it comes from here:

scan_partition_impl(handles, |part_left, part_right, active_handles| {
if let Some(default_value) = default_primary_value {
let iter = self.secondary.seek_iter(part_left);
let mut last_right = part_left;
for (mut left, mut right, value) in iter {
left = left.max(part_left);
right = right.min(part_right).max(left);
for handle in active_handles.iter_mut() {
if last_right < left {
handle.feed_rows(last_right, left, &mut std::iter::once(default_value));
}
handle.feed_rows(left, right, &mut std::iter::once(value));
}
last_right = right;
if right == part_right {
break;
}
}
if last_right < part_right {
for handle in active_handles.iter_mut() {
handle.feed_rows(
last_right,
part_right,
&mut std::iter::once(default_value),
);
}
}

Under certain conditions it loops indefinitely, exhausting the iterator of ranges rather than stopping when done with the requested range.

I have tried just breaking early when the left part of the coverage region is outside the part of interest, and it seems to work fine, i.e. by adding inside the loop:

if left >= part_right {
  break;
}

Existing tests run fine with this added. I will see if I can trigger this error in a minimal test data case and come back with a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions