bgsignature is a package used to compute signatures.
The most basic type of computation is the computation of the counts of the different k-mers (e.g. 3 or 5). This computation can be done for a set of mutations, for a set of regions or for a set of mutation that fall within certain regions.
bgsignature consists of 3 tools:
- count: count different k-mers
- frequency: divide the counts by the total counts
- normalize: divide the counts by counts obtained separately and normalize the results.
Advanced features include:
- ability to group the counts (e.g. group mutations by sample)
- normalize the counts by the context taken from a regions file
- collapse (add together) reverse complementary sequences
This project is a Python package
and can be installed with pip.
Download the source code, get into this
project directory and execute:
pip install .The 3 tools can be called using
- bgsignature count
- bgsignature frequency
- bgsignature normalize
Some examples:
getting help:
bgsignature -h bgsignature frequency -h
count triplets in mutation that fall in certain regions using hg38:
bgsignature count -m my/muts/file -r my/regions/file -g hg38 -o my/output.json --cores 4
Alternative, the command line options have an equivalent in Python:
from bgsignature import count, relative_frequency, normalizethat accept similar parameters except the output. The return object can be used as a dictionary.
If you already have your files loaded in Python you can use directly count function in the corresponding module. E.g.:
from bgsignature.count import mutation
mutation.count(mutations, 'hg38', 3)In addition, you can also
use the the "low-level" functions that
do the count (count_all
and count_group)
which are much simple and do not
perform any kind of parallelization.
E.g.:
from bgsignature.count import mutation
mutation.count_all(mutations, 'hg38', 3)
# or to group mutations by sample
mutation.count_group(mutations, 'hg38', 3, 'SAMPLE')The return object can be normalized to 1,
using the sum1() method
or divided by some normalization counts
using the normalize() method.
There are some behavioural characteristics that must be taken into account:
- bgsignature filters out mutations whose reference nucleotide (as provided in the file), and the corresponding one in the reference genome do not match.
- when using the
collapseoption (enabled by default), bgsignature does not remove one of the collapsed sequences but keeps both. This means that you need to manually remove the ones you are not interested in. - when using
bgsignature.count.mutation.countorbgsignature.count.region.countfunction and a number ofcoresfor paralelization, thechunkparameter must be selected adequately, as a it can have a huge impact on performance.
Tab separated file
(can be compressed into gz, bgz or xz formats)
with a header and at least these columns:
CHROMOSOME, POSITION, REF, ALT.
In addition, SAMPLE, CANCER_TYPE and SIGNATURE
are optional columns that can be used for
grouping the signature.
Tab separated file
(can be compressed into gz, bgz or xz formats)
with a header and at least these columns:
CHROMOSOME, START, END, ELEMENT.
In addition, SYMBOL, and SEGMENT
are optional columns that can be used for
grouping the signature.
If you are having issues, please let us know. You can contact us at: bbglab@irbbarcelona.org