Skip to content

A python package for analyzing variant calls from high throughput single cell genome sequencing experiments

License

Notifications You must be signed in to change notification settings

harrispopgen/cellspec

Repository files navigation

cellspec

Tests Documentation

A python package for analyzing variant calls from high throughput single cell genome sequencing experiments. Provides a convenient scanpy style API for loading joint calling vcf files into anndata objects, and performing downstream processing and analysis tasks, including:

  • coverage analysis and filtering
  • Annotation ancestral trinucleotide sequence context of SNPs
  • Computing trinucleotide mutation spectra
  • Visualizing mutation spectra

in development:

  • sequencing error / artifact correction
  • Mutation signature fitting and de novo signature discovery
  • Phylogenetic analysis
    • distance based
    • maximum liklihood
    • Bayseian
  • eQTL analysis (using genome-transcriptome coassay data)

Installation

Install the latest development version:

git clone https://github.com/harrispopgen/cellspec.git
cd cellspec
pip install -e .

Getting started

As an homage to the semi permeable capsule technology that spurred the need for this package, I encourage the following convention when importing cellspec:

import cellspec as spc

cellspec uses the {class}~anndata.AnnData class to store joint calling data.

:width: 500px

From the scanpy docs:

At the most basic level, an {class}~anndata.AnnData object adata stores a data matrix adata.X, annotation of observations adata.obs and variables adata.var as pd.DataFrame and unstructured annotation adata.uns as dict. Names of observations and variables can be accessed via adata.obs_names and adata.var_names, respectively. {class}~anndata.AnnData objects can be sliced like dataframes, for example, adata_subset = adata[:, list_of_gene_names].

In cellspec, observations are cells (or samples), and variables are bi-allelic sites. Genotype calls from the vcf file are stored in adata.X, and depth information in adata.layers. Total read depth at each site in each observation is stored in adata.layers["DP"], and alternate allele read depth is stored in adata.layers["AD"].

To load a vcf into anndata:

adata = spc.pp.load_vcf(filename)

This initial step can take a somewhat long time, especially for datasets with a lot of alleles. As such, it a good idea to save your data in .h5ad format for more convenient loading in the future:

adata.write_h5ad(filename)

Please refer to the documentation and tutorials for more instruction, and the API documentation for information on specific functionality.

About

A python package for analyzing variant calls from high throughput single cell genome sequencing experiments

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •  

Languages