Skip to content

A collection of Python modules and data analysis pipelines, all related to DNA sequence design.

License

Notifications You must be signed in to change notification settings

e-south/dnadesign

Repository files navigation

CI

dnadesign

dnadesign is a collection of modular bioinformatic pipelines and helper packages related to biological sequence design.

Contents

  1. Directory layout
  2. Documentation
  3. Available tools

Directory layout

dnadesign/
├─ README.md            # High-level project documentation
├─ pyproject.toml
├─ uv.lock
└── src/
    └── dnadesign/
        ├── permuter/    # in silico deep mutational scanning
        ├── infer/       # model-agnostic inference (Evo2 adapter)
        ├── densegen/    # string-packing nucleic acid assembly
        ├── opal/        # active-learning engine
        └── ...

Documentation

  1. Installation
  2. Quickstart marimo notebooks
  3. Maintaining dependencies
  4. CUDA/GPU install notes (BU SCC)
  5. Marimo reference

Available tools

  1. usr (Universal Sequence Record)

    Consists of utility commands to inspect datasets/Parquet files used across the dnadesign project.

  2. densegen

    DNA sequence design pipeline built on the integer linear programming framework from the dense-arrays package.

  3. infer

    Model-agnostic wrapper for DNA/protein language models (e.g., Evo2).

  4. opal

    An EVOLVEpro-style active-learning tool for DNA/protein sequence design campaigns.

  5. cluster

    A Parquet/CSV-first tool for Leiden clustering, UMAP visualisation, and a mix of other analyses.

  6. billboard

    Quantifies the regulatory diversity of dense-array DNA libraries generated by densegen.

  7. libshuffle

    Iteratively subsamples sequence libraries from the sibling sequences directory and computes diversity metrics using the billboard pipeline as its engine.

  8. nmf

    Applies Non-Negative Matrix Factorization (NMF) to a library of sequences generated by densegen to uncover higher-order transcription factor binding site combinations.

  9. latdna

    Pipeline for latent space analysis of DNA sequences.

  10. cruncher

    Pipeline that parses TF position-weight matrices (MEME, JASPAR, etc.) via plug-in parsers, and then runs a discrete Categorical Gibbs optimiser (or other plug-ins) to discover short DNA sequences that score highly on one or more TFs.

  11. tfkdanalysis

    Pipeline for analyzing transcription factor knockdown (TFKD) effects using PPTP-seq (Promoter responses to TF perturbation sequencing) data—a high-throughput approach described in Han et al. (2023).

  12. aligner

    Wrapper for Biopython's PairwiseAligner, which is a class for computing Needleman–Wunsch global alignment scores between nucleotide sequences.

  13. permuter

    Pipeline for biological sequence permutation and subsequent evaluation workflows.

  14. archived

    Contains a mix of old legacy projects and prototypes.


@e-south

About

A collection of Python modules and data analysis pipelines, all related to DNA sequence design.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages