Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,10 @@ To load your own data into the Sequence Tube Map, see the guide to [Adding Your

Previously we provided a Docker image at [https://hub.docker.com/r/wolfib/sequencetubemap/](https://hub.docker.com/r/wolfib/sequencetubemap/), which contained the build of this repo as well as a vg executable for data preprocessing and extraction. We now recommend a different installation approach, either using the [online version](#online-version) or a full installation of the [local version](#local-version). However, if you would like to Dockerize the Sequence Tube Map, the repository includes a `Dockerfile`.

## Using tabix-based index files

More information about using this faster alternative in [README.tabix.md](README.tabix.md).

## Contributing

For information on how to develop on the Sequence Tube Map codebase, pleas see the [Development Guide](doc/development.md).
Expand Down
120 changes: 120 additions & 0 deletions README.tabix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
## Tabix-based index files

Three files are used, each one indexed with tabix (additional `.tbi` file):

1. `nodes.tsv.gz` contains the sequence of each node.
2. `pos.bed.gz` contains the position (as node intervals) of regions on each haplotype.
3. `haps.gaf.gz` contains the path followed by each haplotype (split in pieces).

Briefly, these three index files can be quickly queried to extract a subgraph covering a region of interest: the `pos.bed.gz` index can first tell us which nodes are covered, then the `nodes.tsv.gz` index gives us the sequence of these nodes, and finally we can stitch the haplotype pieces in those nodes from the `haps.gaf.gz` index.
This approach was implemented in a [`chunkix.py`](scripts/chunkix.py) script which can produce a GFA file or files used by the sequenceTubeMap.
The sequenceTubeMap uses this script internally when given tabix-based index files.

## Using tabix-based index files in the sequenceTubeMap

The version on this `tabix` branch can use those index files, for example when mounted files are provided:

- the `pos.bed.gz` index in the *graph* field
- the `nodes.tsv.gz` index in the *node* field
- the `haps.gaf.gz` index in the *haplotype* field

---

![](images/mount.tabix.index.png)

---

Once the index files are mounted, one can query any region on any haplotype in the form *HAPNAME_CONTIG:START-END*.

Other tracks, for example reads or annotations in bgzipped/indexed GAF files, can be added as *reads* in the menu.

---

![](images/mount.tabix.index.annot.png)

---

Of note, you can set a color for each track using the existing palettes or by picking a specific color.

---

![](images/mount.tabix.index.annot.color.png)

---

## ~~Installation~~ Using the docker container

A docker container with this new sequenceTubeMap version, and all the dependencies necessary, is available at `quay.io/jmonlong/sequencetubemap:tabix_dev`.

To use it, run:

```sh
docker run -it -p 3210:3000 -v `pwd`:/data quay.io/jmonlong/sequencetubemap:tabix_dev
```

Of note, the `-p` option redirects port 3000 to 3210.
In practice, pick an unused port.

Then open: http://localhost:3210/

Note: For mounted files, this assumes all files (pangenomes, reads, annotations) are in the current working directory or in subdirectories.
To test with the files that are already prepared, download all the files (see below).
Then, either use them as *custom* Data adding the tracks with the *Configure Tracks* button, or use the prepared Data set "HPRC Minigraph-Cactus v1.1".
For info, the files for this Dataset were defined in the [config.json file](docker/config.json) used to build the docker.

## Available tabix-based index files for the Minigraph-Cactus v1.1 pangenome

Index files and some annotations have been deposited at https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/

To download it all:

```
# pangenome index files
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz.tbi
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz.tbi
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz.tbi

# annotation files
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz.tbi
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz.tbi
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz
wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz.tbi
```

## Building tabix-based index files from a GFA

### Optional. Make a GFA from a GBZ file

In some cases, you will want to use exactly the same pangenome space as a specific GBZ file.
For example, to visualize reads or annotation on that pangenome.
The GFA provided in the HPRC repo might not match exactly because some nodes may have been split when making the GBZ file.
You can convert a GBZ to a GFA (and not translate the nodes back to the original GFA) with:

```sh
vg convert --no-translation -f -t 4 hprc-v1.1-mc-grch38.gbz | gzip > hprc-v1.1-mc-grch38.gfa.gz
```

### Run the `pgtabix.py` python script

The `pgtabix.py` script can be found in the [`scripts` directory](scripts).
It's also present in the `/build/sequenceTubeMap/scripts` directory of the Docker container `quay.io/jmonlong/sequencetubemap:tabix_dev`.

```sh
python3 pgtabix.py -g hprc-v1.1-mc-grch38.gfa.gz -o output.prefix
```

It takes about 1h30-2h to build index files for the Minigraph-Cactus v1.1 pangenome.
This process should scale linearly with the number of haplotypes.

## Making your own annotation files

To make your own annotation files, we have developed a pipeline to project annotation files at the haplotype level (e.g. BED, GFF) onto a pangenome (e.g. GBZ).
Once the projected GAF files are sorted, bgzipped and indexed, they can be queried fast, for example by sequenceTubeMap.

The pipeline is described in the [manuscript](https://jmonlong.github.io/manu-vggafannot/) and script/docs was deposited in [the GitHub repository](https://github.com/jmonlong/manu-vggafannot?tab=readme-ov-file).
In particular, example on how annotation files were projected for this manuscript are described in [this section](https://github.com/jmonlong/manu-vggafannot/tree/main/analysis/annotate).
31 changes: 23 additions & 8 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,37 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true

ENV TZ=America/Los_Angeles


# install basic apt dependencies
# note: most vg apt dependencies are installed by "make get-deps" below
RUN apt-get -qq update && apt-get -qq install -y \
git \
wget \
less \
npm \
nano

git \
wget \
less \
npm \
nano \
make \
g++ \
gcc \
zlib1g-dev \
libbz2-dev \
liblzma-dev \
python3 \
build-essential

# install tabix/bgzip
RUN wget --quiet --no-check-certificate https://github.com/samtools/htslib/releases/download/1.21/htslib-1.21.tar.bz2 && \
tar -xjvf htslib-1.21.tar.bz2 && \
cd htslib-1.21 && \
./configure && \
make && make install

# install node
RUN npm cache clean -f

RUN npm install -g n && n stable

# download vg binary
RUN wget --quiet --no-check-certificate https://github.com/vgteam/vg/releases/download/v1.59.0/vg \
RUN wget --quiet --no-check-certificate https://github.com/vgteam/vg/releases/download/v1.64.1/vg \
&& mv vg /bin/vg && chmod +x /bin/vg

WORKDIR /build
Expand Down
29 changes: 25 additions & 4 deletions docker/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,22 @@
"simplify": false,
"removeSequences": false
},
{
"name": "HPRC Minigraph-Cactus v1.1",
"tracks": [
{"trackFile": "/data/hprc.pos.bed.gz", "trackType": "graph", "trackColorSettings": {"mainPalette": "ygreys", "auxPalette": "greys"}},
{"trackFile": "/data/hprc.nodes.tsv.gz", "trackType": "node"},
{"trackFile": "/data/hprc.haps.gaf.gz", "trackType": "haplotype"},
{"trackFile": "/data/gene_exon.gaf.gz", "trackType": "read", "trackColorSettings": {"mainPalette": "reds", "auxPalette": "reds"}},
{"trackFile": "/data/rm.gaf.gz", "trackType": "read", "trackColorSettings": {"mainPalette": "blues", "auxPalette": "blues"}},
{"trackFile": "/data/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz",
"trackType": "read", "trackColorSettings": {"mainPalette": "plainColors", "auxPalette": "plainColors"}}
],
"region": "GRCh38#0#chr17:7674450-7675333",
"dataType": "built-in",
"simplify": false,
"removeSequences": false
},
{
"name": "Lancet example",
"tracks": [
Expand Down Expand Up @@ -48,6 +64,7 @@
}
],
"vgPath": [""],
"chunkixPath": ["/data", "scripts"],
"dataPath": "/data",
"internalDataPath": "exampleData/internal/",
"tempDirPath": "temp",
Expand All @@ -57,27 +74,31 @@
"defaultGraphColorPalette" : {
"mainPalette": "#000000",
"auxPalette": "greys",
"colorReadsByMappingQuality": false
"colorReadsByMappingQuality": false,
"alphaReadsByMappingQuality": false
},

"defaultHaplotypeColorPalette" : {
"mainPalette": "plainColors",
"auxPalette": "lightColors",
"colorReadsByMappingQuality": false
"colorReadsByMappingQuality": false,
"alphaReadsByMappingQuality": false
},

"defaultReadColorPalette" : {
"mainPalette": "blues",
"auxPalette": "reds",
"colorReadsByMappingQuality": false
"colorReadsByMappingQuality": false,
"alphaReadsByMappingQuality": false
},

"defaultTrackProps" : {
"trackType": "graph",
"trackColorSettings": {
"mainPalette": "#000000",
"auxPalette": "greys",
"colorReadsByMappingQuality": false
"colorReadsByMappingQuality": false,
"alphaReadsByMappingQuality": false
}
},

Expand Down
Binary file added images/mount.tabix.index.annot.color.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/mount.tabix.index.annot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/mount.tabix.index.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading