A lightweight Nextflow pipeline showing how to transform a CSV-based expression matrix into differential expression calls and actionable hits.
flowchart TD
A["Counts CSV (gene × sample)"] --> B["Normalize: log2(CPM+1)"]
A2["Metadata CSV (sample_id, condition)"] --> B
B --> C["Differential Expression (Welch t-test)"]
C --> D["Actionable List CSV (gene_id + annotations)"]
D --> E["Intersect significant genes with actionable list"]
E --> F["Generate plots (PCA, heatmap, volcano, MA)"]
F --> G["Write outputs"]
subgraph Output
G1["normalized_counts.csv"]
G2["differential_expression.csv"]
G3["actionable_hits.csv"]
G4["summary.json and plots/"]
end
G --> G1
G --> G2
G --> G3
G --> G4
-
Input:
- Counts CSV (genes × samples, first column =
gene_id) - Metadata CSV (
sample_id,conditionwith exactly two conditions) - Optional actionable list CSV (
gene_id+ any extra annotations) - Optional annotations table (
gene_id,gene_symbol) to remap IDs to symbols
- Counts CSV (genes × samples, first column =
-
What the pipeline does:
- Normalises counts to log2(CPM+1).
- Runs differential expression (Welch t-test) between the two conditions.
- Intersects significant genes with your actionable list.
- Writes summary stats and generates PCA, heatmap, volcano, and MA plots.
-
Output:
preprocessed/normalized_counts.csvdifferential_expression.csvactionable_hits.csvsummary.jsonplots/(pca_samples.png,heatmap_top_genes.png,volcano.png,ma_plot.png)
-
What Nextflow does here:
Nextflow just glues the steps together and manages files. It takes your inputs, runs the Python scripts in the right order (normalize → DE test → actionable filter → plots), passes the correct files between them, manages work directories, and lets you re-run the whole workflow with different inputs or profiles without manually chaining commands. ::contentReference[oaicite:0]{index=0}
main.nf– pipeline definition.nextflow.config– defaults and optional conda profile.bin/– helper scripts executed by processes.data/raw/– example gene counts + sample metadata.data/reference/– toy actionable gene list.results/– created on run; holds outputs.results/plots/– PCA, heatmap, volcano, and MA plots.
- Install Nextflow (https://www.nextflow.io/) and, optionally,
condaormamba. - Run with the bundled toy data:
Or create an isolated environment:
nextflow run main.nf -profile localnextflow run main.nf -profile conda - Outputs land in
results/:preprocessed/normalized_counts.csvdifferential_expression.csvactionable_hits.csvsummary.jsonplots/:pca_samples.pngheatmap_top_genes.pngvolcano.pngma_plot.png
Pass your own files via params:
nextflow run main.nf \
--counts path/to/gene_counts.csv \
--metadata path/to/sample_metadata.csv \
--actionable path/to/actionable_genes.csv \
--annotations path/to/gene_annotations.tsv \
--outdir my_results
Expected formats:
- Counts:
gene_idcolumn (Ensembl or symbols) plus one column per sample. - Metadata:
sample_id,conditionwith exactly two conditions (e.g., Tumor/Normal). - Optional actionable list:
gene_idplus any annotation columns you like. - Optional annotations: table with
gene_idandgene_symbol(common headers auto-detected).
- Differential expression uses log2 fold-change in the normalized space; adjust thresholds in
bin/actionable_report.pyorbin/plot_reports.pyif desired. - Plots use raw p-value <= 0.05 and |log2FC| >= 1 to color up/down (red/blue) and label the top 5 hits by p-value.
- The example data are minimal and meant for workflow illustration, not biological interpretation.
- Omit
results*/,work/, and.nextflow*when committing or packaging.