Skip to content

Local reference datasets

Kjetil Klepper edited this page Jan 25, 2021 · 52 revisions

Background

Galaxy tools can access shared reference datasets that are described in Data Tables. A data table can be specific to a single tool or be used by several tools. For instance, the data tables "all_fasta" and "gene_annotations" contain, respectively, full genome sequences in FASTA format and feature annotations in GFF/GTF format, and they are both used by many different tools. On the other hand, the table "bowtie2_indexes" contains index-files that are specific to the Bowtie2 tool. The contents of a data table are described in location files (or ".loc"-files), which usually provide the identifier, name and genome build for each dataset in addition to the full path pointing to the location where the dataset can be found. Usually, a data table is made up of multiple location files that are merged together into a single table. If you have access to the Admin page in Galaxy, you can click on "Data Tables" in the left panel to see an overview of all registered data tables and the corresponding loc-files they are made up of. If you click on the name of a table in the left-most column you can also see the full contents of the table.

On UseGalaxy.no, we have two repositores with reference datasets that are shared via CVMFS. The first, located in the directory "/cvmfs/data.galaxyproject.org/" is prepared and maintained by the greater Galaxy community. The second, located in "/cvmfs/data.usegalaxy.no/" is a local repository maintained by Elixir Norway to host reference datasets that we have prepared ourselves.

Adding new datasets to the local repository

There are two ways to add new reference datasets to our local CVMFS repository

  1. Manually adding datasets and updating configuration files
  2. Using a Data Manager to add reference datasets

Manually adding datasets and updating configuration files

Preparations

Before you start, you need to have the reference dataset itself and you need to know which data table it should be added to.

Prepare the reference dataset in advance so that it is easily accessible. This could simply mean finding the URL for a reference dataset online or optionally downloading it. For tool-specific indexes, you will probably have to install the tool somewhere yourself and run a specific command in order to create the index file(s).

If you don't know the name (and format) of the data table, you can find it from the wrapper of a tool that uses the table. Select the tool in Galaxy's tool panel, then click the "Options" button in the top-right corner of the tool execution form and select "See in Tool Shed". This will take you to the tool's repository in the tool shed it was installed from. Click the "Repository Actions" button in the top-right corner and select "Browse repository tip files". This will allow you to see the individual files included with the tool wrapper. Tools that use reference datasets should have a file called tool_data_table_conf.xml.sample. Click on this to view the file. The name of the data table can be found here. (Note that some tools may use more than one table). The wrapper should also include a sample ".loc"-file in the "tool-data" subdirectory.

Adding the reference dataset

  1. Login as "sysadmin" to the VM containing the local reference CVMFS (Howto?). The name of the VM is data.usegalaxy.no for the production server and data.test.usegalaxy.no for the test server.

  2. Become root and make the CVMFS repository writable with the following commands

sudo -s  
cvmfs_server transaction
  1. Create a new directory (if necessary) and add the data to the CVMFS in the right location
cd /cvmfs/data.usegalaxy.no/byhand/

(Note: On the test server, the second-level directory is named "data.test.usegalaxy.no" instead).

Reference datasets under the "byhand" directory are usually organized first by genome build and then by data table. E.g. HISAT2 indexes for hg38 are located under "hg38/hisat2_index/". Whole genome sequences in FASTA format is usually placed in a directory called "seq", e.g. "hg38/seq/". If you have more than one reference dataset for the same genome and each dataset contains multiple files, you can add a third directory level to keep them separate. For instance, we have two RNA STAR reference datasets for SeaLice that were generated with different parameter settings, so these were placed in "lsal01/star_index/lsalatl2s_o75" and "lsal01/star_index/lsalatl2s_o99", respectively.

Create a new directory for the genomebuild and data table (and dataset), if needed, and download/copy the reference dataset file(s) into this directory. Make sure the files and directories are readable by all! Note that the data in the local CVMFS will not be version-controlled by Git, but we will make weekly backups of the repository.

  1. Next, you must add a new entry for the reference dataset in the location file for the data table.
cd /cvmfs/data.usegalaxy.no/byhand/location/

Find the correct loc-file for the data table in this directory, and open it in an available text editor of your choice to add a new entry. The format (column definitions) of the loc-file is usually described in the file itself. This normally includes a unique identifier for the dataset (often called value), the genome build (dbkey), a name (which will be displayed to the user) and the path pointing to the dataset, as shown in the example loc-file below.

# all_fasta.loc.sample
# This file has the format (white space characters are TAB characters):
# <unique_build_id>  <dbkey>  <display_name>  <file_path>
#
apiMel3      apiMel3   Honeybee (Apis mellifera): apiMel3      /path/to/genome/apiMel3/apiMel3.fa
hg19canon    hg19      Human (Homo sapiens): hg19 Canonical    /path/to/genome/hg19/hg19canon.fa
hg19full     hg19      Human (Homo sapiens): hg19 Full         /path/to/genome/hg19/hg19full.fa

The path should always be absolute (starting with "/"), but different tables have different ways of defining the path. If the reference dataset only contains a single file, the path should point directly to this file. If the reference dataset contains multiple files, the path would normally point to the parent directory of the files. However, some tools use other alternatives. For instance, reference datasets for Bowtie2 and HISAT2 consist of multiple files that have the same name but with different suffixes. For these tools, the path should be the full absolute path up to and including the common name-prefix (so if you run the command "ls /path_with_name_prefix*" you should see all the reference files listed).

Note that the columns in the loc-file must be separated with TABs. It is a good idea to check afterwards that the file is correctly formatted, for instance by using the command "cut -f N <file>" to extract the columns one by one (start with N=1 and increase it).

If you are adding a reference dataset for a specific tool but are unsure of which loc-file the tool uses, you can examine the "tool_data_table_conf.xml.sample" file that comes with the tool wrapper (see "Preparations" above).

If you know the name of the loc-file but the file does not exist in the directory, you must create it yourself. This can be a bit tricky, since the tools will expect the loc-file to be formatted in a certain way (which can vary from table to table). It can be a good idea to look around to see if there is another loc-file for the same table somewhere that you can use as a template. If you SSH into the "usegalaxy.no" VM, you can search for the loc-file in the following two locations:

  • /cvmfs/data.galaxyproject.org/byhand/location/
  • /cvmfs/data.galaxyproject.org/managed/location/

If you don't find it here, but you know the name of a tool that uses this table, you can search for the tool in a tool shed and "Browse [its] repository tip files" as explained above. The wrapper should contain a "tool-data" directory with a sample loc-file for the data table.

  1. If the "/cvmfs/data.usegalaxy.no/byhand/location/" directory did not contain a loc-file for the data table and you had to create a new one in the step above, you must also add the table definition for the table to the file "/cvmfs/data.usegalaxy.no/byhand/location/tool_data_table_conf.xml".

A table definition consists of an outer "<table>" element with a "name" attribute specifying the name of the table. This element also has two child-elements. First, a "<columns>" element which should contain a comma-separated list of the names of the columns in the table, and also a "<file>" element with a "path" attribute that points to the location of the corresponding loc-file within the "/cvmfs/data.usegalaxy.no/byhand/location/" directory.

For example, here is the table definition for the local "all_fasta" data table.

    <table name="all_fasta" comment_char="#">
        <columns>value, dbkey, name, path</columns>
        <file path="/cvmfs/data.usegalaxy.no/byhand/location/all_fasta.loc" />
    </table>

Again, the tools will expect the table to be defined in a certain way with certain column names, so you should check if you can find an existing table definition for the table and use that. You can SSH into the "usegalaxy.no" VM and the look at the two files "/cvmfs/data.galaxyproject.org/byhand/location/tool_data_table_conf.xml" and "/cvmfs/data.galaxyproject.org/managed/location/tool_data_table_conf.xml" to see if any of them contain a definition of the table in question, or find a sample "tool_data_table_conf.xml" file bundled with a tool wrapper that uses this table in a tool shed and copy/paste the table definition from there. Just remember to change the "path" attribute of the "<file>" element to point to the local loc-file under the "/cvmfs/data.usegalaxy.no/byhand/location/" directory when you add the table definition to "/cvmfs/data.usegalaxy.no/byhand/location/tool_data_table_conf.xml".

  1. Finally, you must publish the changes to the CVMFS repository and make it read-only again.
cd
cvmfs_server publish

(The first "cd" command is just to take you back to your home directory, since you cannot commit the changes while you are inside the CVMFS directory).

Alternatively, you can discard the changes with the command: cvmfs_server abort.

  1. You can now logout from the DB server and log into "usegalaxy.no" instead to check that the CVMFS is updated there. Note that it will take some time for the changes to be distributed across the nodes, so be patient if it does not happen right away.

  2. Once the CVMFS has been updated, go to the Admin page in Galaxy and select "Data Tables" from the panel on the left. Search for the data table on the page and click on it (there could be multiple entries, so just select one of them). Click the refresh button to update the table and verify that your new reference dataset now appears. A restart of the Galaxy server should not be necessary, unless you have added a new table definition to the "tool_data_table_conf.xml" file. Note that the table does not always update right away. Just keep trying or come back later and try again. (If it still doesn't work, you can consider a restart of Galaxy. (Howto?) )

Adding new genome build keys

If you added a reference dataset for a genome build that Galaxy currently does not recognize, you must add the genome build ("dbkey") also. For this, you need a file containing the length of each chromosome/scaffold in the genome in a two-column TAB-separated format where the first column contains the chromosome names (make sure these are the same names as used by the other reference datasets for this genome!) and the second column is the corresponding length (in bp). You can use the "faSize" tool (available from UCSC or BioConda) to create such a file based on the genome FASTA.

Make the CVMFS writeable, as described above, and place the file in this directory

/cvmfs/data.usegalaxy.no/byhand/lengths/

The file should be named after the genome build (with optional ".len" suffix), e.g. "hg38.len". Next, open the file /cvmfs/data.usegalaxy.no/byhand/location/dbkeys.loc in a text editor and add a new line for your genome build.

Enter the unique ID ("dbkey") for the genome in the first column (e.g. "hg38"), a descriptive name in the second column (e.g. "Human Dec. 2013 (GRCh38/hg38)" ) and finally the full path to the chromosome lengths-file in the third column. Verify that the loc-file is correctly TAB-separated afterwards. Publish the results to CVMFS as explained above and refresh the "__dbkeys__" data table via the Admin page in Galaxy.


Using a Data Manager to add reference datasets

... coming ...

Clone this wiki locally