Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
6fadbc1
minimize memory footprint
Juke34 Mar 13, 2026
11b5ffc
replace BCBio by parse_gff3_streaming to process line by and get rid …
Juke34 Mar 16, 2026
4c26975
fix bug when paired-end where strand is *. We count alignment both st…
Juke34 Mar 16, 2026
2f6d435
Revert "replace BCBio by parse_gff3_streaming to process line by and …
Juke34 Mar 16, 2026
facf65e
add gc and possibility to filter featurs
Juke34 Mar 16, 2026
85446f4
space to re-run
Juke34 Mar 16, 2026
0825ddc
handle NA correctly
Juke34 Mar 16, 2026
6db54c8
centralize env for container building
Juke34 Mar 16, 2026
dcde2d2
centralize env for container building
Juke34 Mar 16, 2026
37faa0d
use pickle to write in temporary file instead to keep data in memory …
Juke34 Mar 17, 2026
54af45e
use dict of SeqID to provide to BCbio gff that will load only the seq…
Juke34 Mar 17, 2026
2a57b5a
change ressources for pluviometer
Juke34 Mar 17, 2026
db67f96
avoid DtypeWarning: Columns (0: SeqID, 1: Start, 2: End, 3: Strand) …
Juke34 Mar 17, 2026
5c75770
decrease memory - process sequentially by base pair
Juke34 Mar 17, 2026
794c40b
remove extension chimaera from AggregateType
Juke34 Mar 17, 2026
b5de7eb
parallelize drip.py
Juke34 Mar 18, 2026
c14da8a
BLAS was using nb CPU out of scope of --threads (even before we paral…
Juke34 Mar 18, 2026
7c090a8
round value to minimize file
Juke34 Mar 18, 2026
95a6564
round when computing instead of writing to save mermory while merging
Juke34 Mar 18, 2026
436c994
update info
Juke34 Mar 18, 2026
f418b56
increase drip CPU
Juke34 Mar 18, 2026
47ca6e8
add coverage parameter set to 10 by default
Juke34 Mar 18, 2026
e65bf69
fix call for threads
Juke34 Mar 19, 2026
081d6ba
update output and AliNe verssion
Juke34 Mar 19, 2026
922804e
mend
Juke34 Mar 19, 2026
753fe46
add debug, fix first feature position to activate feature via state_u…
Juke34 Mar 19, 2026
2bca65e
CoveredSites becomes ObservedSites. We add now TotalSites column, Qua…
Juke34 Mar 20, 2026
68895d9
GenomesBases becomes ObservedBases and we add QualifiedBases
Juke34 Mar 20, 2026
1eb0bd8
add tests
Juke34 Mar 20, 2026
7e08058
add SiteBasePairingsQualified ReadBasePairingsQualified
Juke34 Mar 20, 2026
e4d3805
skip standardize step and do it directly within pluviometer
Juke34 Mar 20, 2026
8b350c3
fix tiny things
Juke34 Mar 20, 2026
1142ff8
skip NA only lines
Juke34 Mar 20, 2026
c7ebec2
fi report only qualified features
Juke34 Mar 20, 2026
87562e4
just to re-run this part
Juke34 Mar 20, 2026
17f0900
change chr by sequence to be more inclusive
Juke34 Mar 20, 2026
c230b7d
add space to re-run this step
Juke34 Mar 20, 2026
3db00a7
remove row if value are only 0.0 and or NA
Juke34 Mar 20, 2026
65bc84d
add ressource to reditools3
Juke34 Mar 21, 2026
e13b448
try to catch end of reditools3 not well catched
Juke34 Mar 22, 2026
f9bdb26
back to previous
Juke34 Mar 22, 2026
80294f3
add calmd to fix deteriorated MD tags
Juke34 Mar 22, 2026
0313348
try to decrease memory footprint
Juke34 Mar 22, 2026
ab91c90
try to improve RAM
Juke34 Mar 23, 2026
c567507
Improve RAM usage by spliting by seqid
Juke34 Mar 23, 2026
ca33d34
add --min-samples-pct --min-group-pct to filter rows
Juke34 Mar 23, 2026
000f874
add filter in sample annd group minimal present
Juke34 Mar 23, 2026
63b136a
output one folder by value type computed espr espf
Juke34 Mar 23, 2026
9387189
separate espf and espr
Juke34 Mar 23, 2026
85dc09e
fix publisdir
Juke34 Mar 23, 2026
386d97e
fix path outpu
Juke34 Mar 23, 2026
43b3bde
better output naming
Juke34 Mar 23, 2026
e2f571d
to rerun
Juke34 Mar 23, 2026
0257b09
create an ID for convenience for sequence and global bmks
Juke34 Mar 23, 2026
c3d0a51
fix to avoid slowlyness when unstranded
Juke34 Mar 24, 2026
25629c0
fix 2 must be interpreted as unstranded
Juke34 Mar 24, 2026
c439441
serealize by editing tool analyzer and fix strand for reditools
Juke34 Mar 24, 2026
e305c31
add barometer scripts for last steps
Juke34 Mar 24, 2026
1ba2ebc
fix certiicate for download
Juke34 Mar 24, 2026
ac97905
standardize and fix containers
Juke34 Mar 24, 2026
7263e8a
standardize for singu
Juke34 Mar 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 30 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,13 +221,14 @@ The two output formats are tables of comma-separated values with a header.
| Start | Positive integer | Starting position of the feature (inclusive) |
| End | Positive integer | Ending position of the feature (inclusive) |
| Strand | `1` or `-1` | Whether the features is located on the positive (5'->3') or negative (3'->5') strand |
| CoveredSites | Positive integer | Number of sites in the feature that satisfy the minimum level of coverage |
| GenomeBases | Comma-separated positive integers | Frequencies of the bases in the feature in the reference genome (order: A, C, G, T) |
| SiteBasePairings | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
| ReadBasePairings | Comma-separated positive integers | Frequencies of genome-variant base pairings in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
| TotalSites | Positive integer | Number of sites in the feature |
| ObservedBases | Comma-separated positive integers | Number and type of the bases in the feature in the reference genome (order: A, C, G, T) observed. The total of the 4 values corresponds to the total observed sites (reported by the editing tools e.g. Reditools3) |
| QualifiedBases | Comma-separated positive integers | Number and type of of the bases in the feature in the reference genome (order: A, C, G, T) that satisfy the minimum level of coverage and editing. The total of the 4 values corresponds to the total qualified sites (> cov) |
| SiteBasePairingsQualified| Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reference level in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing |
| ReadBasePairingsQualified | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reads level in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing |

> [!note]
> The number of **CoveredSites** can be higher than the sum of **SiteBasePairings** because of the presence of ambiguous bases (e.g. N)
> The number of **QualifiedBases** can differ from sum of AA,CC,GG,TT from **SiteBasePairingsQualified** because we can have site 100% edited that will not fall into one of these categories.

An example of the feature output format is shown below, with some alterations to make the text line up in columns.

Expand Down Expand Up @@ -275,10 +276,11 @@ This hierarchical information is provided in the same manner in the aggregate fi
| ParentType | String | Type of the parent of the feature under which the aggregation was done |
| AggregateType | String | Type of the features that are aggregated |
| AggregationMode | `all_isoforms`, `longest_isoform`, `chimaera`, `feature` or `all-sites` | Way in which the aggregation was performed |
| CoveredSites | Positive integer | Number of sites in the aggregated features that satisfy the minimum level of coverage |
| GenomeBases | Comma-separated positive integers | Frequencies of the bases in the aggregated features in the reference genome (order: A, C, G, T) |
| SiteBasePairings | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
| ReadBasePairings | Comma-separated positive integers | Frequencies of genome-variant base pairings in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) |
| TotalSites | Positive integer | Number of sites in the aggregated features |
| ObservedBases | Comma-separated positive integers | Number and type of the bases in the aggregated features in the reference genome (order: A, C, G, T) observed. The total of the 4 values corresponds to the total observed sites (reported by the editing tools e.g. Reditools3) | |
| QualifiedBases | Comma-separated positive integers | Number and type of of the bases in the aggregated features in the reference genome (order: A, C, G, T) that satisfy the minimum level of coverage and editing. The total of the 4 values corresponds to the total qualified sites (> cov) | |
| SiteBasePairingsQualifed | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reference level in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) observed |
| ReadBasePairingsQualifed | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reads level in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing|

In the output of Pluviometer, **aggregation** is the sum of counts from several features of the same type at some feature level. For instance, exons can be aggregated at transcript level, gene level, chromosome level, and genome level.

Expand Down Expand Up @@ -344,6 +346,21 @@ $$
AG\ editing\ level = \sum_{i=0}^{n} \dfrac{AG_i}{AA_i + AC_i + AG_i + AT_i}
$$


## Drip

### espf (edited sites proportion in feature):

denom_espf = df[f'{genome_base}_count'] # X_QualifiedBases (e.g. C_count)
df[espf_col] = df[f'{bp}_sites'] / denom_espf # XY_SiteBasePairingsQualified / X_QualifiedBases

### espr (edited sites proportion in reads):

df[total_reads_col] = XA_reads + XC_reads + XG_reads + XT_reads # all reads at X positions
df[espr_col] = df[f'{bp}_reads'] / df[total_reads_col] # XY_reads / sum(X*_reads)

Drip retains a line only if at least one metric value is neither NA nor zero (i.e., at least one edit has been detected somewhere). Lines containing only NA values, only 0.0 values, or a mix of both are removed by default.

</details>


Expand All @@ -355,3 +372,7 @@ Jacques Dainat (@Juke34)
## Contributing

Contributions from the community are welcome ! See the [Contributing guidelines](https://github.com/Juke34/rain/blob/main/CONTRIBUTING.md)

## TODO

update pluviometer to set NA for start end and strand instead of . to be able to use column as int64 in drip and barometer e.g. dtype={"SeqID": str, "Start": "Int64", "End": "Int64", "Strand": str}
12 changes: 2 additions & 10 deletions bin/README
Original file line number Diff line number Diff line change
Expand Up @@ -20,20 +20,12 @@ python -m pluviometer --sites SITES --gff GFF [OPTIONS]
python pluviometer_wrapper.py --sites SITES --gff GFF [OPTIONS]
```

### drip_features.py
### drip.py
Post-processing tool for pluviometer feature output. Analyzes RNA editing from feature TSV files, calculating editing metrics (espf and espr) for all 16 genome-variant base pair combinations across multiple samples. Combines data into unified matrix format.

**Usage:**
```bash
./drip_features.py OUTPUT_PREFIX FILE1:SAMPLE1 FILE2:SAMPLE2 [...]
```

### drip_aggregates.py
Post-processing tool for pluviometer aggregate output. Similar to drip_features.py but operates on aggregate-level data, calculating editing metrics for aggregated genomic regions across samples.

**Usage:**
```bash
./drip_aggregates.py OUTPUT_PREFIX FILE1:SAMPLE1 FILE2:SAMPLE2 [...]
./drip.py OUTPUT_PREFIX FILE1:GROUP1:SAMPLE1:REPLICATE1 FILE2:GROUP1:SAMPLE2:REPLICATE1 [...]
```

### restore_sequences.py
Expand Down
Loading
Loading