SWittouck · TheOafidian · Oct 18, 2024 · Oct 18, 2024 · Oct 18, 2024 · Oct 3, 2025
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -0,0 +1,26 @@
+name: documentation
+
+on: [push, pull_request, workflow_dispatch]
+
+permissions:
+  contents: write
+
+jobs:
+  docs:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+      - name: Install dependencies
+        run: |
+          pip install sphinx sphinx_rtd_theme sphinxcontrib-jquery myst_parser
+      - name: Sphinx build
+        run: |
+          sphinx-build docs _build
+      - name: Deploy to GitHub Pages
+        uses: peaceiris/actions-gh-pages@v3
+        with:
+          publish_branch: gh-pages
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: _build/
+          force_orphan: true
diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1 @@
+_build
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/_static/img/scarap_logo.png b/docs/_static/img/scarap_logo.png
diff --git a/docs/_static/js/custom.js b/docs/_static/js/custom.js
@@ -0,0 +1,3 @@
+$(document).ready(function () {
+    $('a.external').attr('target', '_blank');
+});
diff --git a/docs/citation.rst b/docs/citation.rst
@@ -0,0 +1,6 @@
+Citation
+--------
+
+If you've found SCARAP useful in your own work, please cite the following manuscript: ::
+
+    Wittouck et al. (2025) "SCARAP: scalable cross-species comparative genomics of prokaryotes", Bioinformatics, Volume 41, Issue 1, January 2025, btae735, https://doi.org/10.1093/bioinformatics/btae735
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,40 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+project = 'SCARAP'
+copyright = '2024, Stijn Wittouck'
+author = 'Stijn Wittouck'
+release = '0.4.0'
+
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = [
+  'sphinx.ext.autodoc',
+  'sphinx.ext.viewcode',
+  'sphinx.ext.napoleon',
+  'sphinx.ext.autodoc',
+  'sphinx.ext.autosummary',
+]
+
+templates_path = ['_templates']
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# Path setup
+import os
+import sys
+sys.path.insert(0, os.path.abspath('..'))
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = 'sphinx_rtd_theme'
+html_static_path = ['_static']
+html_js_files = [
+    'js/custom.js',
+]
diff --git a/docs/feedback.rst b/docs/feedback.rst
@@ -0,0 +1,6 @@
+Feedback
+--------
+
+All feedback and suggestions very welcome at
+stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file
+`issues <https://github.com/SWittouck/scarap/issues>`__.
diff --git a/docs/getting-started.rst b/docs/getting-started.rst
@@ -0,0 +1,110 @@
+Getting started
+===============
+
+Obtaining data
+--------------
+
+SCARAP works mainly with faa files: amino acid sequences of all (predicted) genes in a genome assembly. You can obtain faa files in at least three ways:
+
+* You can run a gene prediction tool like `Prodigal <https://github.com/hyattpd/Prodigal>`_ on genome assemblies of your favorite strains, or a complete annotation pipeline such as `Prokka <https://github.com/tseemann/prokka>`_ or `Bakta <https://github.com/oschwengers/bakta>`_.
+* You can search your favorite taxon on `NCBI genome <https://www.ncbi.nlm.nih.gov/datasets/genome/>`_ and manually download assemblies in the following way: click on an assembly, click "Download", select "Protein (FASTA)" as file type and click "Download" again.
+* Given a list of assembly accession numbers (i.e. starting with GCA/GCF), you can use `ncbi-genome-download <https://github.com/kblin/ncbi-genome-download/>`_ to download the corresponding faa files.
+
+Given a list of accessions in a file called ``accessions.txt``, you can use ncbi-genome-download to download faa files as follows: ::
+
+    ncbi-genome-download -P \
+    --assembly-accessions accessions.txt \
+    --section genbank \
+    --formats protein-fasta \
+    bacteria
+
+Inferring a pangenome
+---------------------
+
+If you want to infer the pangenome of a set of genomes, you only need their faa files (fasta files with protein sequences) as input. If the faa files are stored in a folder ``faas``, you can infer the pangenome using 16 threads by running: ::
+
+    scarap pan ./faas ./pan -t 16
+
+The pangenome will be stored in ``pan/pangenome.tsv``
+The pangenome is stored in a "long format": a table with the columns gene, genome and orthogroup.
+
+Inferring a core genome
+-----------------------
+
+If you want to infer the core genome of a set of genomes directly, without inferring the full pangenome first, you can also do this with SCARAP. The reason you might want to do this, is because it is faster and because you sometimes don't need more than the core genome (e.g. when you are planning to infer a phylogeny).
+
+You can infer the core genome, given a set of faa files in a folder ``faas``, in the following way: ::
+
+    scarap core ./faas ./core -t 16
+
+The core genome will be stored in ``core/genes.tsv``.
+
+Subsampling a set of genomes
+---------------------------
+
+If you have a (large) dataset of genomes that you wish to subsample in a
+representative way, you can do this using the ``sample`` module. You
+will need to precompute the pangenome or core genome to do this; SCARAP
+calculates average amino acid identity (AAI) or core amino acid identity
+(cAAI) values in the subsampling process, and it uses the single-copy
+orthogroups from a pan- or core genome to do this.
+
+For example, if you want to sample 100 genomes given a set of faa files
+in a folder ``faas``:
+
+::
+
+     scarap core ./faas ./core -t 16
+     scarap sample ./faas ./core/genes.tsv ./representatives -m 100 -t 16
+
+
+The representative genomes will be stored in
+``representatives/seeds.txt``.
+
+Important remark: by default, the per-gene amino acid identity values
+are estimated from alignment scores per column by MMseqs (`alignment
+mode
+1 <https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity>`__).
+For AAI values > 90%, these estimations are on average smaller than the
+exact values. It is possible to calculate exact AAI values by adding the
+``--exact`` option to the sample module, but this will be slower.
+
+You can also sample genomes based on average nucleotide identity (ANI)
+or core nucleotide identity (cANI) values. In that case, you need to
+supply nucleotide sequences of predicted genes, e.g. in a folder
+``ffns``:
+
+::
+
+     scarap core ./faas ./core -t 16
+     scarap sample ./ffns ./core/genes.tsv ./representatives -m 100 -t 16
+
+Building a “supermatrix” for a set of genomes
+----------------------------------------------
+
+You can build a concatenated alignment of core genes (“supermatrix”) for
+a set of genomes using the ``concat`` module.
+
+Let’s say you want to build a supermatrix of 100 core genes for a set of
+genomes, with faa files given in a folder ``faas``:
+
+::
+
+     scarap core ./faas ./core -m 100 -t 16
+     scarap concat ./faas ./core/genes.tsv ./supermatrix -t 16
+
+
+The amino acid supermatrix will be saved in
+``supermatrix/supermatrix_aas.fasta``.
+
+If you want to produce a nucleotide-level supermatrix, this can be
+achieved by giving a folder with ffn files (nucleotide sequences of
+predicted genes) as an additional argument:
+
+::
+
+     scarap concat ./faas ./core/genes.tsv ./supermatrix -n ./ffns -t 16
+
+
+The nucleotide-level supermatrix will be saved in
+``supermatrix/supermatrix_nucs.fasta``.
diff --git a/docs/index.rst b/docs/index.rst
@@ -0,0 +1,33 @@
+.. SCARAP documentation master file, created by
+   sphinx-quickstart on Fri Oct 18 12:50:18 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+SCARAP: pangenome inference and comparative genomics of prokaryotes
+====================================================================
+
+.. image:: _static/img/scarap_logo.png
+   :align: center
+   :width: 200
+   :alt: SCARAP logo
+
+SCARAP is a toolkit with modules for various tasks related to comparative genomics of prokaryotes. SCARAP has been designed to be fast and scalable. Its main feature is pangenome inference, but it also has modules for direct core genome inference (without inferring the full pangenome), subsampling representatives from a (large) set of genomes and constructing a concatenated core gene alignment ("supermatrix") that can later be used for phylogeny inference. SCARAP has been designed for prokaryotes but should work for eukaryotic genomes as well. It can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets with prokaryotic genomes from the species to the order level.
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+   install
+   getting-started
+   scarap
+   feedback
+   citation
+   license
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/docs/install.rst b/docs/install.rst
@@ -0,0 +1,39 @@
+Installing SCARAP
+=================
+
+The easiest way to get started is to install SCARAP using conda.
+
+Conda
+-----
+
+First, create and activate a new conda environment: ::
+
+    conda create -n scarap python=3.11
+    conda activate scarap
+
+Then install from the recipe on bioconda: ::
+
+    conda install bioconda::scarap    
+
+
+Pip
+---
+
+First make sure that MAFFT and MMseqs2 are properly installed. Then install SCARAP with pip: ::
+
+    pip install scarap
+
+
+Manual install
+--------------
+
+You can also install SCARAP manually by cloning it and installing the following dependencies:
+
+* `Python3 <https://www.python.org/>`_ version \>= 3.6.7 and < 3.13
+    * `numpy <https://numpy.org/>`_ version \>= 1.16.5
+    * `scipy <https://www.scipy.org/>`_ version \>= 1.4.1
+    * `pandas version <https://pandas.pydata.org/>`_ \>= 1.5.3
+    * `biopython <https://biopython.org/>`_ version \>= 1.67
+    * `ete3 <http://etetoolkit.org/>`_ version \>= 3.1.1
+* `MAFFT <https://mafft.cbrc.jp/alignment/software/>`_ version \>= 7.407
+* `MMseqs2 <https://github.com/soedinglab/MMseqs2>`_ release 11, 12 or 13
diff --git a/docs/license.rst b/docs/license.rst
@@ -0,0 +1,5 @@
+License
+-------
+
+SCARAP is free software, licensed under
+`GPLv3 <https://github.com/SWittouck/scarap/blob/master/LICENSE>`__.
diff --git a/docs/make.bat b/docs/make.bat
@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
diff --git a/docs/modules.rst b/docs/modules.rst
@@ -0,0 +1,22 @@
+Modules
+-------
+
+SCARAP is able to perform a number of specific tasks related to
+prokaryotic comparative genomics (see also ``scarap -h``).
+
+The most useful modules of SCARAP are probably the following:
+
+-  ``pan``: infer a pangenome from a set of faa files
+-  ``core``: infer a core genome from a set of faa files
+-  ``sample``: sample a subset of representative genomes
+
+Modules for other useful tasks are also available:
+
+-  ``build``: build a profile database for a core/pangenome
+-  ``search``: search query genes in a profile database
+-  ``checkgenomes``: assess the genomes in a core genome. Returns a table with `genome`, `completeness` and `contamination`.
+-  ``checkgroups``: assess the orthogroups in a core genome
+-  ``filter``: filter the genomes/orthogroups in a pangenome
+-  ``concat``: construct a concatenated core orthogroup alignment from a
+   core genome
+-  ``fetch``: fetch sequences and store in fasta per orthogroup
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -0,0 +1,3 @@
+sphinx
+sphinx-rtd-theme
+sphinxcontrib-jquery