Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: documentation

on: [push, pull_request, workflow_dispatch]

permissions:
contents: write

jobs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- name: Install dependencies
run: |
pip install sphinx sphinx_rtd_theme sphinxcontrib-jquery myst_parser
- name: Sphinx build
run: |
sphinx-build docs _build
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
publish_branch: gh-pages
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: _build/
force_orphan: true
1 change: 1 addition & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
_build
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Binary file added docs/_static/img/scarap_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/_static/js/custom.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
$(document).ready(function () {
$('a.external').attr('target', '_blank');
});
6 changes: 6 additions & 0 deletions docs/citation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Citation
--------

If you've found SCARAP useful in your own work, please cite the following manuscript: ::

Wittouck et al. (2025) "SCARAP: scalable cross-species comparative genomics of prokaryotes", Bioinformatics, Volume 41, Issue 1, January 2025, btae735, https://doi.org/10.1093/bioinformatics/btae735
40 changes: 40 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'SCARAP'
copyright = '2024, Stijn Wittouck'
author = 'Stijn Wittouck'
release = '0.4.0'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.viewcode',
'sphinx.ext.napoleon',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
]

templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']

# Path setup
import os
import sys
sys.path.insert(0, os.path.abspath('..'))

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']
html_js_files = [
'js/custom.js',
]
6 changes: 6 additions & 0 deletions docs/feedback.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Feedback
--------

All feedback and suggestions very welcome at
stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file
`issues <https://github.com/SWittouck/scarap/issues>`__.
110 changes: 110 additions & 0 deletions docs/getting-started.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
Getting started
===============

Obtaining data
--------------

SCARAP works mainly with faa files: amino acid sequences of all (predicted) genes in a genome assembly. You can obtain faa files in at least three ways:

* You can run a gene prediction tool like `Prodigal <https://github.com/hyattpd/Prodigal>`_ on genome assemblies of your favorite strains, or a complete annotation pipeline such as `Prokka <https://github.com/tseemann/prokka>`_ or `Bakta <https://github.com/oschwengers/bakta>`_.
* You can search your favorite taxon on `NCBI genome <https://www.ncbi.nlm.nih.gov/datasets/genome/>`_ and manually download assemblies in the following way: click on an assembly, click "Download", select "Protein (FASTA)" as file type and click "Download" again.
* Given a list of assembly accession numbers (i.e. starting with GCA/GCF), you can use `ncbi-genome-download <https://github.com/kblin/ncbi-genome-download/>`_ to download the corresponding faa files.

Given a list of accessions in a file called ``accessions.txt``, you can use ncbi-genome-download to download faa files as follows: ::

ncbi-genome-download -P \
--assembly-accessions accessions.txt \
--section genbank \
--formats protein-fasta \
bacteria

Inferring a pangenome
---------------------

If you want to infer the pangenome of a set of genomes, you only need their faa files (fasta files with protein sequences) as input. If the faa files are stored in a folder ``faas``, you can infer the pangenome using 16 threads by running: ::

scarap pan ./faas ./pan -t 16

The pangenome will be stored in ``pan/pangenome.tsv``
The pangenome is stored in a "long format": a table with the columns gene, genome and orthogroup.

Inferring a core genome
-----------------------

If you want to infer the core genome of a set of genomes directly, without inferring the full pangenome first, you can also do this with SCARAP. The reason you might want to do this, is because it is faster and because you sometimes don't need more than the core genome (e.g. when you are planning to infer a phylogeny).

You can infer the core genome, given a set of faa files in a folder ``faas``, in the following way: ::

scarap core ./faas ./core -t 16

The core genome will be stored in ``core/genes.tsv``.

Subsampling a set of genomes
---------------------------

If you have a (large) dataset of genomes that you wish to subsample in a
representative way, you can do this using the ``sample`` module. You
will need to precompute the pangenome or core genome to do this; SCARAP
calculates average amino acid identity (AAI) or core amino acid identity
(cAAI) values in the subsampling process, and it uses the single-copy
orthogroups from a pan- or core genome to do this.

For example, if you want to sample 100 genomes given a set of faa files
in a folder ``faas``:

::

scarap core ./faas ./core -t 16
scarap sample ./faas ./core/genes.tsv ./representatives -m 100 -t 16


The representative genomes will be stored in
``representatives/seeds.txt``.

Important remark: by default, the per-gene amino acid identity values
are estimated from alignment scores per column by MMseqs (`alignment
mode
1 <https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity>`__).
For AAI values > 90%, these estimations are on average smaller than the
exact values. It is possible to calculate exact AAI values by adding the
``--exact`` option to the sample module, but this will be slower.

You can also sample genomes based on average nucleotide identity (ANI)
or core nucleotide identity (cANI) values. In that case, you need to
supply nucleotide sequences of predicted genes, e.g. in a folder
``ffns``:

::

scarap core ./faas ./core -t 16
scarap sample ./ffns ./core/genes.tsv ./representatives -m 100 -t 16

Building a “supermatrix” for a set of genomes
----------------------------------------------

You can build a concatenated alignment of core genes (“supermatrix”) for
a set of genomes using the ``concat`` module.

Let’s say you want to build a supermatrix of 100 core genes for a set of
genomes, with faa files given in a folder ``faas``:

::

scarap core ./faas ./core -m 100 -t 16
scarap concat ./faas ./core/genes.tsv ./supermatrix -t 16


The amino acid supermatrix will be saved in
``supermatrix/supermatrix_aas.fasta``.

If you want to produce a nucleotide-level supermatrix, this can be
achieved by giving a folder with ffn files (nucleotide sequences of
predicted genes) as an additional argument:

::

scarap concat ./faas ./core/genes.tsv ./supermatrix -n ./ffns -t 16


The nucleotide-level supermatrix will be saved in
``supermatrix/supermatrix_nucs.fasta``.
33 changes: 33 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
.. SCARAP documentation master file, created by
sphinx-quickstart on Fri Oct 18 12:50:18 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

SCARAP: pangenome inference and comparative genomics of prokaryotes
====================================================================

.. image:: _static/img/scarap_logo.png
:align: center
:width: 200
:alt: SCARAP logo

SCARAP is a toolkit with modules for various tasks related to comparative genomics of prokaryotes. SCARAP has been designed to be fast and scalable. Its main feature is pangenome inference, but it also has modules for direct core genome inference (without inferring the full pangenome), subsampling representatives from a (large) set of genomes and constructing a concatenated core gene alignment ("supermatrix") that can later be used for phylogeny inference. SCARAP has been designed for prokaryotes but should work for eukaryotic genomes as well. It can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets with prokaryotic genomes from the species to the order level.


.. toctree::
:maxdepth: 2
:caption: Contents:

install
getting-started
scarap
feedback
citation
license

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
39 changes: 39 additions & 0 deletions docs/install.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Installing SCARAP
=================

The easiest way to get started is to install SCARAP using conda.

Conda
-----

First, create and activate a new conda environment: ::

conda create -n scarap python=3.11
conda activate scarap

Then install from the recipe on bioconda: ::

conda install bioconda::scarap


Pip
---

First make sure that MAFFT and MMseqs2 are properly installed. Then install SCARAP with pip: ::

pip install scarap


Manual install
--------------

You can also install SCARAP manually by cloning it and installing the following dependencies:

* `Python3 <https://www.python.org/>`_ version \>= 3.6.7 and < 3.13
* `numpy <https://numpy.org/>`_ version \>= 1.16.5
* `scipy <https://www.scipy.org/>`_ version \>= 1.4.1
* `pandas version <https://pandas.pydata.org/>`_ \>= 1.5.3
* `biopython <https://biopython.org/>`_ version \>= 1.67
* `ete3 <http://etetoolkit.org/>`_ version \>= 3.1.1
* `MAFFT <https://mafft.cbrc.jp/alignment/software/>`_ version \>= 7.407
* `MMseqs2 <https://github.com/soedinglab/MMseqs2>`_ release 11, 12 or 13
5 changes: 5 additions & 0 deletions docs/license.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
License
-------

SCARAP is free software, licensed under
`GPLv3 <https://github.com/SWittouck/scarap/blob/master/LICENSE>`__.
35 changes: 35 additions & 0 deletions docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
22 changes: 22 additions & 0 deletions docs/modules.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Modules
-------

SCARAP is able to perform a number of specific tasks related to
prokaryotic comparative genomics (see also ``scarap -h``).

The most useful modules of SCARAP are probably the following:

- ``pan``: infer a pangenome from a set of faa files
- ``core``: infer a core genome from a set of faa files
- ``sample``: sample a subset of representative genomes

Modules for other useful tasks are also available:

- ``build``: build a profile database for a core/pangenome
- ``search``: search query genes in a profile database
- ``checkgenomes``: assess the genomes in a core genome. Returns a table with `genome`, `completeness` and `contamination`.
- ``checkgroups``: assess the orthogroups in a core genome
- ``filter``: filter the genomes/orthogroups in a pangenome
- ``concat``: construct a concatenated core orthogroup alignment from a
core genome
- ``fetch``: fetch sequences and store in fasta per orthogroup
3 changes: 3 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sphinx
sphinx-rtd-theme
sphinxcontrib-jquery
Loading
Loading