GeneCAD is an in silico genome annotation pipeline for plants built on top of the DNA foundation model PlantCAD2 (Zhai et al. 2025).
GeneCAD can be installed one of two ways:
- Using a pre-built Docker image
- Using a virtual environment via
uv
We also provide instructions for executing it one of two ways:
- Using an ephemeral cloud cluster via SkyPilot
- Using an existing SLURM/HPC cluster
See the sections below for more details.
This example demonstrates how to run the GeneCAD inference pipeline for a single chromosome via Docker using the image published at ghcr.io/plantcad/genecad:
# Clone the GeneCAD repository
git clone --single-branch https://github.com/plantcad/genecad && cd genecad
# Optionally, checkout a release tag for greater reproducibility
# git checkout v0.0.11
# Pull the image
docker pull ghcr.io/plantcad/genecad:latest
# Run inference on Arabidopsis chromosome 4
docker run --rm --gpus all \
-v $(pwd):/workspace -w /workspace \
ghcr.io/plantcad/genecad:latest \
bash examples/scripts/run_inference.sh
# Run inference on Arabidopsis chromosome 4 using large base PlantCAD2 model
docker run --rm --gpus all \
-v $(pwd):/workspace -w /workspace \
-e HEAD_MODEL_PATH=plantcad/GeneCAD-l8-d768-PC2-Large \
-e BASE_MODEL_PATH=kuleshov-group/PlantCAD2-Large-l48-d1536 \
ghcr.io/plantcad/genecad:latest \
bash examples/scripts/run_inference.shSee examples/scripts/run_inference.sh for more information on what this does. Note that the docker container only contains the environment, not source code. Source code is mounted directly so that any changes in the cloned GeneCAD repository take effect immediately.
For convenience, see genecad_inference_pipeline.log for logs from a complete run of this example so that you know what to expect.
Requirements for installing GeneCAD include CUDA, Linux and uv. Use these steps to create a compatible environment:
# Install uv (https://docs.astral.sh/uv/getting-started/installation/)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment (run in project root)
uv venv # Initializes Python 3.12 environment
# Install dependencies
# The mamba/causal-conv1d dependencies can take ~3-30 minutes
# to build from source, depending on CPUs available
uv sync --extra torch --extra mambaPyTorch is included as an optional dependency that can be left out if you wish to install it yourself. This option is provided because the managed torch version in this project is pinned to 2.7.1 built with CUDA 12.8, which is restrictive but necessary to ensure compatibility with the frozen mamba (2.2.4) and causal-conv1d (1.5.0.post8) dependencies. Newer CUDA and PyTorch versions may work, but are not tested or officially supported.
Note also that mamba and causal-conv1d are currently forced to build from source (without build isolation). This can be slow, but it is more reliable than relying on pre-built wheels. Those wheels, for Mamba 2.2.4 at least, often install correctly and then fail with inscrutable errors later. This is probably because the wheels are only specific to major CUDA versions. Either way, installing from source is safer and it is always possible to cache the wheels if you wish. You could add the following in the pyproject.toml for this project to do so:
[tool.uv.sources]
mamba-ssm = { path = "path/to/mamba_ssm-2.2.4-*.whl" }
causal-conv1d = { path = "path/to/causal_conv1d-1.5.0.post8-*.whl" }
# OR upload the wheels somewhere and use a remote URL:
mamba-ssm = { url = "https://.../mamba_ssm-2.2.4-*.whl" }
causal-conv1d = { url = "https://.../causal_conv1d-1.5.0.post8-*.whl" }
If you do not already have a system with an NVIDIA GPU, or are interested in convenient ways to get one (or several), we recommend using SkyPilot. This provides an interface for optimizing costs and resource utilization across many cloud providers.
Here are steps you can take from any Mac, Linux or Windows local workstation after installing uv:
uv pip install "skypilot[lambda]>=0.10.3"
# Clear local SkyPilot state, if any
sky api stop; [ -d ~/.sky ] && rm -rf ~/.sky
# Deploy single-node cluster
sky launch --num-nodes 1 --yes --no-setup \
--cluster genecad examples/configs/cluster.sky.yaml
# Run example inference pipeline and evaluation
sky exec --cluster genecad examples/configs/task.sky.yaml
# SSH to the node
ssh genecad
# Terminate the node
sky down genecadBetter GPUs are typically more cost-efficient, but wasteful in development, so we recommend using SkyPilot to be able to switch between cheap sources for both. See the Throughput section below for context on how much time and cost is required to run GeneCAD for different GPUs. SkyPilot makes shopping around for these much more convenient, e.g.:
SkyPilot GPU rates
> sky show-gpus L4:1 # low-end development GPU
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE REGION
L4 1.0 RunPod 1x_L4_SECURE - 4 24GB $ 0.440 - CA
L4 1.0 GCP g2-standard-4 24GB 4 16GB $ 0.705 $ 0.282 us-east4
L4 1.0 AWS g6.xlarge 22GB 4 16GB $ 0.805 $ 0.232 us-east-2
L4 1.0 AWS g6.2xlarge 22GB 8 32GB $ 0.978 $ 0.323 us-east-2
L4 1.0 AWS g6.4xlarge 22GB 16 64GB $ 1.323 $ 0.475 us-east-2
L4 1.0 AWS gr6.4xlarge 22GB 16 128GB $ 1.539 $ 0.423 us-east-2
L4 1.0 AWS g6.8xlarge 22GB 32 128GB $ 2.014 $ 0.673 us-east-2
L4 1.0 AWS gr6.8xlarge 22GB 32 256GB $ 2.446 $ 0.558 us-east-2
L4 1.0 AWS g6.16xlarge 22GB 64 256GB $ 3.397 $ 1.200 us-east-1
> sky show-gpus H100:1 # high-end production GPU
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE REGION
H100 1.0 Hyperbolic 1x-H100-75-722 80GB 75 722GB $ 1.290 - default
H100 1.0 Vast 1x-H100_SXM-32-65536 80GB 32 64GB $ 1.870 $ 0.000 , US, NA
H100 1.0 Vast 1x-H100_NVL-32-65536 94GB 32 64GB $ 2.250 $ 0.000 Bulgaria, BG, EU
H100 1.0 RunPod 1x_H100_SECURE - 16 80GB $ 2.390 - CA
H100 1.0 Lambda gpu_1x_h100_pcie 80GB 26 200GB $ 2.490 - europe-central-1
H100 1.0 Lambda gpu_1x_h100_sxm5 80GB 26 225GB $ 3.290 - europe-central-1
H100 1.0 Cudo epyc-genoa-h100-nvl-pcie_1x2v4gb 94GB 2 4GB $ 2.499 - au-melbourne-1
H100 1.0 Cudo epyc-genoa-h100-nvl-pcie_1x4v8gb 94GB 4 8GB $ 2.529 - au-melbourne-1
H100 1.0 Cudo epyc-genoa-h100-nvl-pcie_1x12v24gb 94GB 12 24GB $ 2.646 - au-melbourne-1
H100 1.0 Cudo epyc-genoa-h100-nvl-pcie_1x24v48gb 94GB 24 48GB $ 2.823 - au-melbourne-1
H100 1.0 nebius gpu-h100-sxm_1gpu-16vcpu-200gb 80GB 16 200GB $ 2.950 $ 1.250 eu-north1
H100 1.0 GCP a3-highgpu-1g 80GB 26 234GB $ 5.383 $ 2.525 us-central1
H100 1.0 Paperspace H100 - 15 80GB $ 5.950 - East Coast (NY2)
H100 1.0 DO gpu-h100x1-80gb 80GB 20 240GB $ 6.740 - tor1
H100 1.0 AWS p5.4xlarge 80GB 16 256GB $ 6.880 - us-east-1
H100 1.0 Azure Standard_NC40ads_H100_v5 - 40 320GB $ 6.980 $ 6.980 eastusIf you wish to use multiple nodes with SkyPilot, then you can construct a Task like this based on environment variables for distributed execution
workdir: .
run: |
export WORLD_SIZE=$SKYPILOT_NUM_NODES
export RANK=$SKYPILOT_NODE_RANK
... # Execute GeneCAD pipeline as beforeThe RANK and WORLD_SIZE environment variables are not used by any part of the pipeline other than the actual prediction step, which writes Xarray datasets in the format predictions.{rank}.zarr (see Outputs for details). GeneCAD does not currently support reading these files in a SkyPilot cluster. GeneCAD was built in an HPC/SLURM cluster with shared storage (see Using SLURM for details on that), so you would need to copy all of these datasets to a networked filesystem (depending on your cloud provider) or gather them all on node before continuing the pipeline. Feel free to open an issue if you want to use a distributed cluster with no shared storage, and we can likely help make that work.
To use SLURM, we recommend following the uv instructions above to initialize a virtual environment. You will need to source the appropriate TCL/Lmod modules for your cluster, e.g. CUDA 12.8 and Python 3.12. Other combinations will likely work as well, as long as they are newer than CUDA 12.4, Python 3.11 and PyTorch 2.5.1 (our earliest tested versions).
See the Docker section above for context on the basics of running an example pipeline. The rest of these instructions focus on how to distribute work in a more realistic setting. E.g. it is often simplest to begin by running the GeneCAD pipeline with an interactive SLURM allocation like this:
# Allocate interactive session:
salloc -partition=gpu-queue --nodes=4 --ntasks=4
# Within interactive session:
cd /path/to/GeneCAD
INPUT_FILE=/path/to/input.fasta \
OUTPUT_DIR=/path/to/output \
SPECIES_ID=species_id \
CHR_ID=chromosome_id \
LAUNCHER="srun python" \
make -f pipelines/prediction allThis will distribute the computation of model predictions across nodes (4 nodes in the example). For offline batch scheduling with sbatch, we suggest following this pattern:
# Create SLURM wrapper script
cat << 'EOF' > genecad.slurm
#!/bin/bash
export RANK=$SLURM_PROCID
export WORLD_SIZE=$SLURM_NNODES
export LAUNCHER="srun python"
exec "$@"
EOF
# Schedule processing for a single FASTA file across 4 nodes
sbatch \
--partition=gpu-queue \
--nodes=4 \
--ntasks=4 \
genecad.slurm \
INPUT_FILE=/path/to/input.fasta \
OUTPUT_DIR=/path/to/output \
SPECIES_ID=species_id \
CHR_ID=chromosome_id \
make -f pipelines/prediction allThe GeneCAD pipeline relies only on settings for RANK and WORLD_SIZE to distribute the computation of model predictions. All pre and post processing steps around that require only CPUs and are executed on a single host. Depending on your cluster topology, you may wish to break up the pipeline like this instead:
export INPUT_FILE=/path/to/input.fasta
export OUTPUT_DIR=/path/to/output
export SPECIES_ID=species_id
export CHR_ID=chromosome_id
# CPU-only preprocessing (FASTA -> Xarray)
sbatch -p cpu-queue -N 1 -n 1 genecad.slurm make -f pipelines/prediction sequences
# Distributed prediction generation on GPU nodes
sbatch -p gpu-queue -N 10 -n 10 genecad.slurm make -f pipelines/prediction predictions
# CPU-only post-processing (Xarray -> GFF)
sbatch -p cpu-queue -N 1 -n 1 genecad.slurm make -f pipelines/prediction annotationsNote that currently only one GPU per-node is supported. This means that you cannot invoke make predictions with --n-tasks-per-node greater than 1.
The inference pipeline requires these inputs at a minimum:
INPUT_FILE=/path/to/input.fasta \
OUTPUT_DIR=/path/to/output \
SPECIES_ID=species_id \
CHR_ID=chromosome_id \
make -f pipelines/prediction allNote that the SPECIES_ID variable is used for more informative logging and it is added to attributes in resulting Xarray datasets (along with CHR_ID). It can be any descriptive string, e.g. athaliana or arabidopsis_thaliana. The CHR_ID variable must be a string matching the name of the target chromosome in the input FASTA file. It too will be included in logging and attributes.
See the pipelines/prediction Makefile for more details on other configurable parameters.
The pipeline consists of 3 main steps:
- Extract sequences from the input FASTA file
- Generate token and feature logits from the GeneCAD classifier
- Generate gene/transcript/feature annotations from the token and feature logits (as GFF)
Each of these has an associated make target, i.e. make sequences, make predictions, make annotations (respectively).
The shell commands run by make can be seen with either the --dry-run or -n flags, each of which does the same thing. This tells make to print all the commands that would be executed, which can be helpful for debugging or clarity on what the pipeline actually does. It can also be useful for executing individual steps manually.
Additionally, this can be very useful for determining what targets already exist. I.e. if you run make predictions instead of make all, the pipeline will only run up to the target for generating token logits from the GeneCAD classifier. Running make -n annotations (or make -n all, which is an alias), would then show that only the GFF export needs to be run rather than the whole pipeline.
The inference pipeline will generate results (in $OUTPUT_DIR) with the following structure:
tree -L 5 -I '[0-9]*|[.]' $OUTPUT_DIR
├── pipeline
│ ├── intervals.zarr
│ │ ├── intervals
│ │ │ ├── decoding
│ │ │ ├── entity_index
│ │ │ ├── entity_name
│ │ │ ├── interval
│ │ │ ├── start
│ │ │ ├── stop
│ │ │ └── strand
│ │ └── sequences
│ │ ├── feature
│ │ ├── feature_logits
│ │ ├── feature_predictions
│ │ ├── sequence
│ │ ├── strand
│ │ └── token
│ ├── predictions__raw__feat_len_2__gene_len_30.gff
│ ├── predictions__raw__feat_len_2__gene_len_30__has_req_feats.gff
│ ├── predictions__raw__feat_len_2.gff
│ ├── predictions__raw.gff
│ ├── predictions.zarr
│ │ └── predictions.0.zarr
│ │ ├── negative
│ │ │ ├── feature
│ │ │ ├── feature_logits
│ │ │ ├── feature_predictions
│ │ │ ├── sequence
│ │ │ ├── token
│ │ │ ├── token_logits
│ │ │ └── token_predictions
│ │ └── positive
│ │ ├── feature
│ │ ├── feature_logits
│ │ ├── feature_predictions
│ │ ├── sequence
│ │ ├── token
│ │ ├── token_logits
│ │ └── token_predictions
│ └── sequences.zarr
│ └── athaliana
└── predictions.gffThe predictions.gff file contains the final annotations while intermediate results in pipeline are returned for debugging or further analysis. If needed, tokenized sequences are available as sequences.zarr and logits from the underlying GeneCAD model are present in predictions.zarr.
This table shows observed GeneCAD inference throughput on different devices and clouds. Of note:
- Datacenter-class GPUs (
A100s/H100s) are 3-6x more cost efficient than older, highly available development GPUs like NVIDIAL4's on Google Cloud GeneCAD-Largeis ~3x more expensive to run thanGeneCAD-Small
| Provider | Instance | Model | Throughput (bp/s) | Cost ($/hr) | Cost per Mbp ($) |
|---|---|---|---|---|---|
| GCE | g2-standard-32 / nvidia-l4 | GeneCAD-Small | 4,063 | 1.7344 | 0.1186 |
| Lambda | gpu_1x_a100_sxm4 | GeneCAD-Small | 17,687 | 1.2900 | 0.0202 |
| Lambda | gpu_1x_h100_pcie | GeneCAD-Small | 22,020 | 2.4900 | 0.0314 |
| Lambda | gpu_2x_h100_sxm5 | GeneCAD-Small | 34,290 | 6.3800 | 0.0517 |
| Lambda | gpu_2x_h100_sxm5 | GeneCAD-Large | 11,223 | 6.3800 | 0.1579 |
These estimates can be used to extrapolate costs and GPU hours necessary for a few reference genomes. The estimated costs below all assume the lowest Cost per Mbp ($) observed above on Lambda A100 40G SXM4 GPUs (17,687 bp/s, $0.0202 / Mbp) with the GeneCAD-Small model. Any cost can be multiplied by ~3 to approximate the cost of using the GeneCAD-Large model instead.
| Genome | Length (Mbp) | GPU Hours | Cost ($) |
|---|---|---|---|
| Arabidopsis thaliana (thale cress) | 120 | 2.12 | 2.42 |
| Zea mays (corn) | 2,182 | 34.27 | 44.04 |
| Hordeum vulgare (barley) | 4,224 | 66.34 | 85.32 |
| Triticum aestivum (wheat) | 14,577 | 227.00 | 294.46 |
For example, this table is saying that it costs about $44 to run GeneCAD-Small on the Zea mays genome to produce a predicted annotations (gff) file.
After generating initial predictions, you can refine them using the ReelProtein pipeline. This post-processing step merges fragmented gene predictions, generates protein embeddings using ProtT5, scores sequences using XGBoost models, and produces a filtered GFF file.
The pipeline handles model downloading automatically via Hugging Face.
uv run python scripts/refine.py \
--gff /path/to/predictions.gff \
--genome /path/to/genome.fa \
--out /path/to/refined_output.gffA simple way to evaluate predicted annotations is with the gffcompare tool. It can either be built from source or installed via conda. See examples/scripts/run_evaluation.sh for an example of how to use it. See Dockerfile for an example installation from source.
While gffcompare offers a useful starting point for evaluating in silico annotations, it is limited by its inability to accurately assess regions independent of predicted UTRs. UTRs are known to be very difficult to predict accurately from DNA alone and these predictions are of limited utility in many applications. As a result, we offer a separate evaluation tool for assessing performance of whole-transcript annotations that excludes UTRs. This tool produces metrics aggregated to several levels (much like gffcompare) as well as a transcript_cds score that scores perfect, predicted annotations of introns and exons over translated regions only. Here is some example usage:
python scripts/gff.py evaluate \
--reference /path/to/reference.gff \
--input /path/to/predictions.gff \
--output /path/to/results
cat /path/to/results/gffeval.stats.tsv
# level precision recall f1
# transcript_cds 77.34 47.39 58.77
# transcript_intron 68.70 42.09 52.20
# transcript 0.00 0.00 0.00
# exon_cds_longest_transcript_only 83.06 60.16 69.78
# exon_longest_transcript_only 60.05 40.73 48.54
# intron_cds_longest_transcript_only 87.58 66.24 75.43
# intron_longest_transcript_only 89.21 62.00 73.16
# exon_cds 94.04 47.10 62.77
# exon 65.14 28.72 39.86
# intron_cds 97.29 51.15 67.05
# intron 96.32 45.36 61.68For each metric level, three values are reported: precision, recall, and F1 score. The GFF specified by --reference contains all true features, while the gff specified by --input provides the predicted features. Below is a definition of a match, or true positive, for each level.
| Level | Description |
|---|---|
| transcript_cds | Instances where a predicted transcript's CDS exactly matches the true transcript's CDS. Discrepancies in the UTRs are allowed. |
| transcript_intron | Instances where a predicted transcript's intron boundaries exactly match the true transcript's intron boundaries. Discrepancies in the transcription start and stop are allowed, as long as they do not alter splice sites. |
| transcript | Instances where a predicted transcript exactly matches the true transcript, including CDS and UTRs. |
| exon_cds_longest_transcript_only | Instances where a predicted exon's boundaries exactly match the true exon's boundaries, taking into account only the exons from the longest true transcript and excluding UTR exons. If an exon is split between UTR and CDS sequence, only the CDS portion is condsidered. |
| exon_longest_transcript_only | Instances where a predicted exon's boundaries exactly match a true exon's boundaries, taking into account only the exons from the longest true transcript. |
| intron_cds_longest_transcript_only | Instances where a predicted intron's boundaries exactly match the true intron's boundaries, taking into account only the introns from the longest true transcript and excluding introns that border UTR sequence. |
| intron_longest_transcript_only | Instances where a predicted intron's boundaries exactly match a true intron's boundaries, taking into account only the introns from the longest true transcript. |
| exon_cds | Instances where a predicted exon's boundaries exactly match the true exon's boundaries, excluding UTR exons. If an exon is split between UTR and CDS sequence, only the CDS portion is condsidered. Exons from alternative transcripts are considered. |
| exon | Instances where a predicted exon's boundaries exactly match a true exon's boundaries. Exons from alternative transcripts are considered. |
| intron_cds | Instances where a predicted intron's boundaries exactly match the true intron's boundaries, excluding introns that border UTR sequence. Introns from alternative transcripts are considered. |
| intron | Instances where a predicted intron's boundaries exactly match a true intron's boundaries. Exons from alternative transcripts are considered. |
Note: for the first three transcript-level metrics, a true positive result is counted whenever the predicted transcript matches at least one of the true transcripts for a gene with alternative splicing.
Use this to initialize a local development environment, with or without a GPU:
uv sync --dev
pre-commit install
# Check types
pyrefly check --summarize-errors
# Run tests that don't require GPUs
pytest -vrsHF checkpoints should be uploaded to https://huggingface.co/plantcad. Currently, the Lightning checkpoints used as-is so there is little to this other than a single upload:
# cd /path/to/checkpoints/small-base
hf upload plantcad/GeneCAD-l8-d768-PC2-Small model.ckpt --repo-type model
# cd /path/to/checkpoints/large-base
hf upload plantcad/GeneCAD-l8-d768-PC2-Large model.ckpt --repo-type modelIn the future, this may include a conversion to safetensors/gguf format first.
To build the Docker image, launch a Lambda VM via SkyPilot (which installs Docker), and then build the image from there.
# On Lambda, you need to add default user to docker group
sudo usermod -aG docker ubuntu && newgrp docker
# Build the image
docker build --progress=plain --no-cache -t genecad:v1.0.1 .
# Test the build
docker run --rm --gpus all -v $(pwd):/workspace -w /workspace genecad:v1.0.1 \
bash examples/scripts/run_inference.sh && \
bash examples/scripts/run_evaluation.shPublish to GitHub Container Registry:
# Tag and push to GitHub Container Registry
IMAGE=ghcr.io/plantcad/genecad
docker tag genecad:v1.0.1 $IMAGE:v1.0.1
docker tag genecad:v1.0.1 $IMAGE:latest
# Requires a GitHub personal access token with "write:packages"
# stored in GHCR_TOKEN
echo $GHCR_TOKEN | docker login ghcr.io -u <github-username> --password-stdin
docker push $IMAGE:v1.0.1
docker push $IMAGE:latestThis code demonstrates how to reproduce published GeneCAD results for a particular species, i.e. Juglans regia (Walnut) chromosome 1:
# Build environment via uv and source it
source .venv/bin/activate
export INPUT_DIR=data
export OUTPUT_DIR=results
mkdir -p $INPUT_DIR $OUTPUT_DIR
# Download FASTA and GFF files
hf download plantcad/genecad-dev data/fasta/Juglans_regia.Walnut_2.0.dna.toplevel_chr1.fa --repo-type dataset --local-dir .
hf download plantcad/genecad-dev data/gff/Juglans_regia.Walnut_2.0.60_chr1.gff3 --repo-type dataset --local-dir .
# Run inference pipeline in container
docker run --rm --gpus all \
-v $(pwd):/workspace -w /workspace \
-e OUTPUT_DIR \
-e INPUT_FILE=data/fasta/Juglans_regia.Walnut_2.0.dna.toplevel_chr1.fa \
-e SPECIES_ID=jregia -e CHR_ID=1 \
ghcr.io/plantcad/genecad:latest \
make -f pipelines/prediction all # add -n for dry run first to make sure paths are correct
# Run gffcompare (inside or outside container)
docker run --rm \
-v $(pwd):/workspace -w /workspace \
ghcr.io/plantcad/genecad:latest gffcompare \
-r data/gff/Juglans_regia.Walnut_2.0.60_chr1.gff3 \
-C -o $OUTPUT_DIR/gffcompare $OUTPUT_DIR/predictions.gff
cat $OUTPUT_DIR/gffcompare.stats
# #= Summary for dataset: results/predictions.gff
# # Query mRNAs : 2253 in 2253 loci (1786 multi-exon transcripts)
# # (0 multi-transcript loci, ~1.0 transcripts per locus)
# # Reference mRNAs : 3472 in 2589 loci (2896 multi-exon)
# # Super-loci w/ reference transcripts: 1991
# #-----------------| Sensitivity | Precision |
# Base level: 79.7 | 76.7 |
# Exon level: 60.8 | 75.0 |
# Intron level: 76.1 | 92.1 |
# Intron chain level: 44.6 | 72.4 |
# Transcript level: 43.9 | 67.7 |
# Locus level: 58.5 | 67.7 |
# Matching intron chains: 1293
# Matching transcripts: 1525
# Matching loci: 1514
# Missed exons: 2844/15331 ( 18.6%)
# Novel exons: 827/12229 ( 6.8%)
# Missed introns: 2139/12078 ( 17.7%)
# Novel introns: 537/9976 ( 5.4%)
# Missed loci: 594/2589 ( 22.9%)
# Novel loci: 241/2253 ( 10.7%)
# Total union super-loci across all input datasets: 2232
# 2253 out of 2253 consensus transcripts written in results/gffcompare.combined.gtf (0 discarded as redundant)