Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
da37737
Vectorize report graphics
austintwang Apr 11, 2025
ac1112a
Additional hit distribution plots
austintwang Apr 12, 2025
2085399
Utility for collapsing overlapping hits
austintwang Apr 13, 2025
c456094
Documentation for `collapse-hits`
austintwang Apr 13, 2025
5bc44d6
Cache jit outputs
austintwang Apr 13, 2025
3618533
Hit intersection utility
austintwang Apr 14, 2025
0241df3
Additional motif trimming options
austintwang Apr 14, 2025
58ebbd8
Switch to fraction threshold in `collapse-hits`
austintwang Apr 29, 2025
8a99815
Confusion matrix visualization
austintwang May 1, 2025
2cb2bf2
Adjust plot padding
austintwang May 1, 2025
61182f7
Adjust confusion matrix calculation
austintwang May 1, 2025
1947e80
Refactor visualization and eval code
austintwang May 5, 2025
ff02820
Update gitignore
austintwang Aug 30, 2025
670e1fe
Code documentation
austintwang Aug 31, 2025
e9cd90f
Methods diagram
austintwang Aug 31, 2025
747908c
Figure width
austintwang Aug 31, 2025
92a67bf
Figure width
austintwang Aug 31, 2025
76cb5ab
Figure width
austintwang Aug 31, 2025
b9a17c4
Figure width
austintwang Aug 31, 2025
516cfc7
Figure width
austintwang Aug 31, 2025
9f60341
Figure width
austintwang Aug 31, 2025
fb58fac
TF-MoDISco capitalization
austintwang Sep 1, 2025
03fb6aa
Update license
austintwang Sep 1, 2025
abf970f
API docs
austintwang Sep 1, 2025
4c2f2fa
API docs
austintwang Sep 1, 2025
b4998f9
Fix missing dependencies
austintwang Sep 1, 2025
1736b28
Module-based entry point
austintwang Sep 1, 2025
6e065fd
Move defaults to functions
austintwang Sep 1, 2025
5a1489c
Link to API docs
austintwang Sep 1, 2025
36ed9ea
Define `__all__`
austintwang Sep 1, 2025
6f13030
README tweaks
austintwang Sep 1, 2025
fb962d4
Make dimension names consistent
austintwang Sep 2, 2025
2c9154f
Ensure that hits_unique is well-defined
austintwang Sep 9, 2025
821bcb2
Remove unneeded guard
austintwang Sep 9, 2025
55cb36a
Improve motif name sorting logic
austintwang Sep 11, 2025
56f21f4
Properly encode report URLs
austintwang Sep 11, 2025
ed38c9d
Histogram edge case fallback
austintwang Sep 14, 2025
9df34e8
README typo
austintwang Sep 14, 2025
d5429df
Install torch from pip
austintwang Sep 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Generate API Documentation

on:
push:
branches: [ dev ]
workflow_dispatch:

jobs:
deploy-docs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'

- name: Install dependencies
run: |
pip install pdoc
pip install -e .

- name: Generate documentation
run: |
pdoc -d numpy --output-dir ./docs finemo

- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./docs
8 changes: 4 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.conda
.DS_Store
*.egg-info
__pycache__
/.*
!/.github
/notebooks
/notebooks/old
/scratch.txt
/scratch.txt
/scratch
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2023 Austin Wang
Copyright (c) 2025 Austin Wang

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
120 changes: 98 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,38 @@
# finemo_gpu
# Fi-NeMo: Finding Neural Network Motifs

**Fi-NeMo** (**Fi**nding **Ne**ural network **Mo**tifs) is a GPU-accelerated hit caller for identifying occurrences of TFMoDISCo motifs within contribution scores generated by machine learning models.
**Fi-NeMo** (**Fi**nding **Ne**ural Network **Mo**tifs) is a GPU-accelerated motif instance calling tool for identifying transcription factor binding sites from neural network contribution scores.

## Overview

Fi-NeMo implements a competitive optimization approach using proximal gradient descent to identify motif instances by solving a sparse linear reconstruction problem. Unlike traditional sequence-based methods, Fi-NeMo leverages context-aware importance scores from deep neural networks to comprehensively map transcription factor binding sites, enabling the identification of both high-confidence canonical motifs and low-prevalence cofactor motifs that are often missed by conventional approaches.

The algorithm represents contribution scores as weighted combinations of motif contribution weight matrices (CWMs) at specific genomic positions. This competitive assignment process more closely reflects the biological reality of transcription factors competing for binding sites, resulting in superior sensitivity and specificity compared to sequence-only methods.

### Features

- **GPU-accelerated optimization**: Fast processing of large contribution score datasets using PyTorch
- **Competitive motif assignment**: Biologically-motivated algorithm that resolves similar motifs
- **Context-aware analysis**: Leverages neural network importance scores for improved sensitivity and specificity
- **Comprehensive evaluation**: Built-in tools for assessing and visualizing motif discovery quality and hit calling performance
- **Multiple input formats**: Support for bigWig, HDF5, and TF-MoDISco output formats

## Method

Fi-NeMo solves motif instance calling as an optimization problem that reconstructs contribution score tracks as sparse linear combinations of motif CWMs, formulated as an L1-regularized linear model. This competitive assignment encourages overlapping motif instances to be resolved in a meaningful way, with stronger matches receiving higher coefficients while weaker or redundant matches are suppressed.

<div align="center">
<img src="/assets/methods.svg" width="400">
</div>

## References

Fi-NeMo is described in:
> Tseng, Ramalingam, Wang, Schreiber, et al. "Decoding predictive motif lexicons and syntax from deep learning models of transcription factor binding profiles." (manuscript in preparation)

Related tools:
- [TF-MoDISco](https://github.com/jmschrei/tfmodisco-lite): *De novo* motif discovery from importance scores
- [BPNet](https://github.com/kundajelab/bpnet-refactor): Deep learning models for TF binding prediction
- [ChromBPNet](https://github.com/kundajelab/chrombpnet): Deep learning models for chromatin accessibility prediction

## Installation

Expand All @@ -19,7 +51,7 @@ cd finemo_gpu

#### Create a Conda Environment with Dependencies

This step is optional but recommended
This step is optional but recommended for conda users.

```sh
conda env create -f environment.yml -n $ENV_NAME
Expand Down Expand Up @@ -58,9 +90,19 @@ Recommended:

- Peak region coordinates in uncompressed [ENCODE NarrowPeak](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) format.

## Usage
## API Documentation

For Fi-NeMo's Python API documentation, see: https://www.austintwang.com/finemo_gpu/finemo.html

Fi-NeMo includes a command-line utility named `finemo`. Here, we describe basic usage for each subcommand. For all options, run `finemo <subcommand> -h`.
## Command-Line Usage

Fi-NeMo provides a command-line utility named `finemo` for motif instance calling and analysis. The typical workflow involves three main steps:

1. **Preprocessing**: Transform input contributions and sequences into a unified format
2. **Hit Calling**: Identify motif instances using the Fi-NeMo algorithm
3. **Reporting and Analysis**: Generate visualizations and perform post-processing

For detailed options for any subcommand, run `finemo <subcommand> -h`.

### Preprocessing

Expand Down Expand Up @@ -126,14 +168,14 @@ Usage: `finemo extract-regions-modisco-fmt -s <sequences> -a <attributions> -o <

#### `finemo call-hits`

Identify hits in input regions using TFMoDISCo CWM's.
Identify motif instances in input regions using the Fi-NeMo competitive optimization algorithm. This is the core functionality that leverages TF-MoDISco CWMs to find motif occurrences in contribution score data.

Usage: `finemo call-hits -r <regions> -m <modisco_h5> -o <out_dir> [-p <peaks>] [-t <cwm_trim_threshold>] [-l <global_lambda>] [-b <batch_size>] [-J]`

- `-r/--regions`: A `.npz` file of input sequences, contributions, and coordinates. Created with a `finemo extract-regions-*` command.
- `-m/--modisco-h5`: A tfmodisco-lite output H5 file of motif patterns.
- `-o/--out-dir`: The path to the output directory.
- `-t/--cwm-trim-threshold`: The threshold to determine motif start and end positions within the full CWMs. Default is 0.3.
- `-t/--cwm-trim-threshold`: The threshold to determine motif start and end positions within the full CWMs. Default is 0.3. If you need finer control over motif trimming, check out the `-T/--cwm-trim-thresholds` and `-R/--cwm-trim-coords` options.
- `-l/--global-lambda`: The L1 regularization weight determining the sparsity of hits. Default is 0.7.
- `-b/--batch-size`: The batch size used for optimization. Default is 2000.
- `-J/--compile`: Enable JIT compilation for faster execution. This option may not work on older GPUs.
Expand All @@ -156,7 +198,7 @@ Usage: `finemo call-hits -r <regions> -m <modisco_h5> -o <out_dir> [-p <peaks>]
- `peak_name`: The name of the peak region containing the hit, taken from the `name` field of the input peak data. `NA` if peak coordinates are not provided.
- `peak_id`: The numerical index of the peak region containing the hit.

`hits_unique.tsv`: A deduplicated list of hits in the same format as `hits.tsv`. In cases where peak regions overlap, `hits.tsv` may list multiple instances of a hit, each linked to a different peak. `hits_unique.tsv` arbitrarily selects one instance per duplicated hit. This file is generated only if peak coordinates are provided.
`hits_unique.tsv`: A deduplicated list of hits in the same format as `hits.tsv`. In cases where peak regions overlap, `hits.tsv` may list multiple instances of a hit, each linked to a different peak. `hits_unique.tsv` arbitrarily selects one instance per duplicated hit. **This file is empty if peak coordinates are not provided.**

`hits.bed`: A coordinate-sorted BED file of unique hits. It includes:

Expand Down Expand Up @@ -195,20 +237,37 @@ Usage: `finemo call-hits -r <regions> -m <modisco_h5> -o <out_dir> [-p <peaks>]

`params.json`: The parameters used for hit calling.

#### Additional notes
#### Parameter Guidelines

**Sensitivity Control (`-l/--global-lambda`)**
- Controls sparsity and sensitivity of hit calling
- Higher values (e.g., 0.8-0.9) → fewer, higher-confidence hits
- Lower values (e.g., 0.5-0.6) → more sensitive, may include weaker hits
- Default of 0.7 works well for chromatin accessibility data
- ChIP-seq data may benefit from lower values (0.6)

**Motif Trimming (`-t/--cwm-trim-threshold`)**
- Determines where motif boundaries are set within full CWMs
- Lower values → more conservative trimming, longer motifs
- Higher values → more aggressive trimming, shorter core motifs
- Affects resolution of closely-spaced motif instances

**Performance Optimization (`-b/--batch-size`, `-J`)**
- Set batch size to utilize available GPU memory efficiently
- Reduce batch size if you encounter out-of-memory errors
- Enable JIT compilation (`-J`) for faster execution on newer GPUs

- The `-l/--global-lambda` parameter controls the sensitivity of the hit-calling algorithm, with higher values resulting in fewer but more confident hits. This parameter represents the minimum cosine similarity between a query contribution score window and a CWM to be considered a hit. The default value of 0.7 typically works well for chromatin accessibility data. ChIP-Seq data may require a lower value (e.g. 0.6).
- The `-t/--cwm-trim-threshold` parameter sets the maximum relative contribution score in trimmed-out CWM flanks. If you find that motif flanks are being trimmed too aggressively, consider lowering this value. However, a too-low value may result in closely-spaced motif instances being missed.
- Set `-b/--batch-size` to fill a significant fraction of your GPU memory. **If you encounter GPU out-of-memory errors, try lowering this value.**
- Legacy TFMoDISCo H5 files can be updated to the newer TFMoDISCo-lite format with the `modisco convert` command found in the [tfmodisco-lite](https://github.com/jmschrei/tfmodisco-lite/tree/main) package.
- The hit-calling thresholding procedure is scale-invariant. That is, whether a position is assigned a hit depends on the shapes of the motif CWM and the contribution scores, not the absolute magnitude of the scores. If you wish to prioritize hits based on the magnitude of the contribution scores, set a per-motif rank threshold the `hit_coefficient_global` field in the `hits.tsv` file, which captures both the absolute importance and the closeness of match.
#### Important Notes

### Output reporting
- **Scale Invariance**: Hit calling depends on motif and contribution score shapes, not absolute magnitudes. Use `hit_coefficient_global` or `hit_importance` for importance-based thresholding.
- **Legacy Format Support**: Convert older TF-MoDISco files using `modisco convert` from [tfmodisco-lite](https://github.com/jmschrei/tfmodisco-lite).

### Output reporting and post-processing

#### `finemo report`

Generate an HTML report (`report.html`) visualizing TF-MoDISCo seqlet recall and hit distributions.
If `-n/--no-recall` is not set, the regions used for hit calling must exactly match those used during the TF-MoDISCo motif discovery process.
Generate an HTML report (`report.html`) visualizing TF-MoDISco seqlet recall and hit distributions.
If `-n/--no-recall` is not set, the regions used for hit calling must exactly match those used during the TF-MoDISco motif discovery process.
This command does not utilize the GPU.

Usage: `finemo report -r <regions> -H <hits> -o <out_dir> [-m <modisco_h5>] [-W <modisco_region_width>] [-n]`
Expand All @@ -220,12 +279,29 @@ Usage: `finemo report -r <regions> -H <hits> -o <out_dir> [-m <modisco_h5>] [-W
- `-W/--modisco-region-width`: The width of the region around each peak summit used by tfmodisco-lite. Default is 400.
- `-n/--no-recall`: Do not compute motif recall metrics. Default is False.

#### Additional outputs
Additional report outputs:

- `motif_report.tsv`: Statistics on the distribution of hits per motif. The columns and values correspond to those in the HTML report's table.
- `motif_occurrences.tsv`: The number of hits of each motif in each input region. Also includes the total number of hits per region.
- `CWMs`: A directory containing visualizations of motif CWMs, as well as corresponding tables with numerical CWM values.
- `seqlets.tsv`: tf-modisco seqlet coordinates for each motif in each region. Only generated if `-m/--modisco-h5` is provided.

#### `finemo collapse-hits`

Identify the best hits by motif similarity within groups of overlapping hits. Adds a 0/1 `is_primary` column to the `hits.tsv` file, indicating whether a hit is the best hit in its group. This command does not utilize the GPU.

Usage: `usage: finemo collapse-hits -i <hits> -o <out_path> [-O <overlap>]`

- `-i/--hits`: The path to the input hits file. This should be the `hits.tsv` or `hits_unique.tsv` file generated by the `finemo call-hits` command.
- `-o/--out-path`: The path to the output file. This will be a copy of the input file with an additional `is_primary` column.
- `-O/--overlap-frac`: The minimum fraction overlap required for two hits to be considered overlapping. Precisely, given two hits of lengths `x` and `y`, the minimum number of overlapping bases is `overlap_frac * (x + y) / 2`. Default is 0.2.

`motif_report.tsv`: Statistics on the distribution of hits per motif. The columns and values correspond to those in the HTML report's table.
#### `finemo intersect-hits`

`motif_occurrences.tsv`: The number of hits of each motif in each input region. Also includes the total number of hits per region.
Find the intersection of hits across multiple runs. This command does not utilize the GPU.

`CWMs`: A directory containing visualizations of motif CWMs, as well as corresponding tables with numerical CWM values.
Usage: `finemo intersect-hits -i <hits> -o <out_path> [-r]`

`seqlets.tsv`: tf-modisco seqlet coordinates for each motif in each region. Only generated if `-m/--modisco-h5` is provided.
- `-i/--hits`: The path to one or more input hits file. This should be the `hits.tsv` or `hits_unique.tsv` file generated by the `finemo call-hits` command.
- `-o/--out-path`: The path to the output file. Reoccuring columns are suffixed with the positional index of the input file (e.g. `hit_importance_1`), with the exception of index 0.
- `-r/--relaxed`: By default, the intersection assumes consistent input region definitions (name and coordinates) and motif trimming across runs. In contrast, this relaxed intersection criteria uses only motif names and untrimmed hit coordinates. However, this is not suitable when hit genomic coordinates are unknown. Default is False.
1 change: 1 addition & 0 deletions assets/methods.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 6 additions & 5 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
channels:
- pytorch
- nvidia
- conda-forge
- bioconda
- nodefaults
dependencies:
- pytorch=2.5.1
- pytorch-cuda=12.4
- python=3.11
- numba=0.61.2
- numpy=2.2.0
- scipy=1.14.1
- polars=1.17.1
Expand All @@ -17,4 +14,8 @@ dependencies:
- tqdm=4.67.1
- jinja2=3.1.4
- pybigwig=0.3.23
- pyfaidx=0.8.1.3
- pyfaidx=0.8.1.3
- jaxtyping=0.3.2
- pip=25.2
- pip:
- torch==2.5.1
11 changes: 8 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,25 +6,27 @@ build-backend = "setuptools.build_meta"
name = "finemo"
description = "Identification of regulatory elements from neural network contribution scores for DNA."
keywords = ["deep learning", "genomics"]
version = "0.30"
version = "0.40"
readme = "README.md"
license = {file = "LICENSE"}
authors = [
{name = "Austin Wang", email = "austin.wang1357@gmail.com"},
{name = "Anshul Kundaje"}
{name = "Anshul Kundaje", email = "akundaje@stanford.edu"}
]
dependencies = [
"numpy",
"scipy",
"torch",
"numba",
"polars>=1.0",
"matplotlib",
"h5py",
"hdf5plugin",
"tqdm",
"pyBigWig",
"pyfaidx",
"jinja2"
"jinja2",
"jaxtyping"
]

[project.scripts]
Expand All @@ -33,3 +35,6 @@ finemo = "finemo.main:cli"
[project.urls]
Homepage = "https://github.com/austintwang/finemo_gpu"
Repository = "https://github.com/austintwang/finemo_gpu.git"

[tool.ruff]
ignore = ["F722"]
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
# Empty setup.py for compatibility with pip<21.1
# See pyproject.toml for package configuration

setup()
setup()
80 changes: 80 additions & 0 deletions src/finemo/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
"""Fi-NeMo: Finding Neural Network Motifs.

A GPU-accelerated motif instance calling tool for identifying transcription factor
binding sites from neural network contribution scores.

Fi-NeMo implements a competitive optimization approach using proximal gradient descent
to identify motif instances by solving a sparse linear reconstruction problem. The
algorithm represents contribution scores as weighted combinations of motif contribution
weight matrices (CWMs) at specific genomic positions.

Key Features
------------
- GPU-accelerated hit calling using PyTorch
- Support for multiple input formats (bigWig, HDF5, TF-MoDISco)
- Competitive motif instance assignment
- Comprehensive evaluation and visualization tools
- Post-processing utilities for hit refinement

Modules
-------
- hitcaller : Core Fi-NeMo algorithm implementation
- data_io : Data input/output utilities
- main : Command-line interface
- evaluation : Performance assessment tools
- visualization : Plotting and report generation
- postprocessing : Hit refinement and analysis

Examples
--------
Basic hit calling workflow:

>>> import finemo
>>> from finemo import data_io, hitcaller
>>>
>>> # Load preprocessed data
>>> sequences, contribs, peaks_df, has_peaks = data_io.load_regions_npz('regions.npz')
>>> cwms, trim_masks = data_io.load_motif_cwms('motifs.h5')
>>>
>>> # Call hits
>>> hits_df, qc_df = hitcaller.fit_contribs(
... cwms=cwms,
... contribs=contribs,
... sequences=sequences,
... cwm_trim_mask=trim_masks,
... use_hypothetical=False,
... lambdas=np.array([0.7] * len(cwms)),
... step_size_max=3.0,
... step_size_min=0.08,
... sqrt_transform=False,
... convergence_tol=0.0005,
... max_steps=10000,
... batch_size=1000,
... step_adjust=0.7,
... post_filter=True,
... device=None,
... compile_optimizer=False
... )

See Also
--------
TF-MoDISco : https://github.com/jmschrei/tfmodisco-lite
BPNet : https://github.com/kundajelab/bpnet-refactor
ChromBPNet: https://github.com/kundajelab/chrombpnet
"""

from . import data_io
from . import hitcaller
from . import evaluation
from . import visualization
from . import postprocessing
from . import main

__all__ = [
"data_io",
"hitcaller",
"evaluation",
"visualization",
"postprocessing",
"main",
]
8 changes: 8 additions & 0 deletions src/finemo/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""
Entry point for running finemo's CLI as a module via 'python -m finemo'.
"""

from .main import cli

if __name__ == "__main__":
cli()
Loading