Tutorial

Copyright and License

Lemon-Tree is free software, released under the terms of the GNU general Public License (GPL) v2 (link).

Lemon-Tree is free of charge for the academic world and non-profit organizations. For any commercial usage of this software, please contact us.

The software is using the following Java libraries, provided with the package:

Colt library for high performance computing (web).
XOM library for xml parsing (web).
Epsgraphics library to create figures (web).
Apache CLI library to parse command line arguments (web).
The BiNGO Java library to calculate GO enrichment (web).

Coding crew:

Tom Michoel
Eric Bonnet
Anagha Joshi
Steven Maere
Pau Erola

Requirements:

Java version 1.8 or higher installed.

Versioning and downloads#

The Lemon-Tree executable name (the .jar file) is formatted with a prefix (lemontree), followed by a version number, e.g. v3.0. The version number will of course be changed with subsequent releases, fixing bugs and/or adding new functionalities. In this document, we will always use the prefix lemontree without specifying the version number. Users should replace this name with the latest version of the jar file available. At this moment, the latest version is 3.1.1. Older binary versions can be downloaded from our Google drive repository.

Installation

The Lemon-Tree software is a command line program, there is no graphical user interface (at this stage). The package is a zipped archive that you have to unzip in the directory of your choice.

Let's take an example. Say the archive was unpacked in /home/jdoe/progs/. You can call the program by the command:

java -jar /home/jdoe/progs/lemontree.jar

You can pass memory size parameters to the java virtual machine, it might help if you have a large dataset. For example, this command is allocating 10 Gigabytes of memory (max.) for the program:

java -Xmx10g -jar /home/jdoe/progs/lemontree.jar

An alternative to launch the program is to set the CLASSPATH environmental variable to include the path to the Lemon-Tree jar file and also to all the Java libraries (located in the "lib/" directory of the package). Then you can launch the program with the command (do not replace lemontree with the current version number for this command):

java lemontree.modulenetwork.RunCli

Synopsis

The purpose of the Lemon-Tree software package is to create a module network from expression data. The end result is a set of gene clusters (co-expressed genes), and their associated "regulators". To achieve this goal, you'll have to follow the LemonTree recipe, which consists of different steps that have to be performed. For each step, you'll have to use one or more specific part of the software, which are designated as "tasks". There are three fundamental tasks:

Generate several cluster solutions ("ganesh" task).
Merge the different cluster solutions using the fuzzy clustering algorithm ("tight_clusters" task).
Assign regulators to each cluster, producing the module network ("regulators" task).

Tutorial

In this guided example, we will infer a module network from cancer related data. The data was taken from the TCGA project (The Cancer Genome Atlas). More specifically, we will use expression data for mRNA and microRNA from Glioblastoma samples.

Please note that we have created very small datasets to keep the computational time short. Thus, the results do not aim to be representative. Note also that if you repeat the procedure, you might get slightly different results due to the small size of the examples and to stochastic fluctuations.

Ganesh task

Functions: cluster genes from a matrix (rows) using a Gibbs sampling procedure.

The first step (clustering) is done on the mRNA expression data only. We did a selection of genes having non-flat profiles, keeping genes having a standard deviation above a certain value (determined by looking at the histogram). We usually use 0.5 as the cutoff score, but this might depend on the dataset. In this example, we have an input matrix of 500 genes and 50 samples. The data is centered and scaled (by row) to have a mean of 0 and a standard deviation of 1. we can generate a cluster solution with the command:

java -jar lemontree.jar -task ganesh -data_file data/expr_matrix.txt -output_file cluster1

Note that the default behavior of the program is to create the output file in the current directory. To create the output file in a different directory, first create it (for example "my_results") and then use this directory with the "-output_file" parameter like this:

-output_file my_results/cluster1

The clustering procedure should be repeated multiple times, using the same command, only changing the name of the output file. Here we have generated 5 runs, prefixed cluster1, 2, 3, 4, 5. The example files are in the "results" directory.

Tight clusters task

Here, we are going to generate a single, robust clustering solution from all the individual solutions generated at the previous step.

java -jar lemontree.jar -task tight_clusters -data_file data/expr_matrix.txt -cluster_file cluster_file_list -output_file tight_clusters.txt -node_clustering true

The "cluster_file" is a simple text file, listing the location of all the individual cluster files generated at the "ganesh" step.

By default, this option is keeping only clusters that have a minimum of 10 genes. This number can be easily changed by overriding the parameter "min_clust_size" with another value on the command line for this task (see the wiki page on default parameters).

There are two methods included to create the tight clusters, either by clustering nodes (-node_clustering true) or by clustering edges (-node_clustering false).

Regulators task

In this task, we are going to assign sets of "regulators" to each of the modules using a probabilistic scoring, taking into account the profile of the candidate regulator.

The candidate regulators are divided in two different types, depending on the nature of their profiles: continuous or discrete. The first type can be for example transcription factors selected from the expression matrix, or microRNA for which we have expression profiles. The latter can be for example clinical parameters (such as the grade of a disease, represented by some discrete values). In all cases, we must have profiles that match the samples of the tight clusters defined previously (missing values are allowed).

For this tutorial, we have selected a set of 85 candidate regulators from the mRNA expression matrix, based on their GO annotation (corresponding to either transcription factors, signal transducers or kinase activity). We have also selected a set of 100 microRNAs, for which we have expression profiles measured by dedicated microarrays. Both datasets correspond to continuous types of data. We will assign regulators separately for the two types.

java -jar lemontree.jar -task regulators -data_file data/expr_matrix.txt -reg_file data/reg_list.txt -cluster_file tight_clusters.txt -output_file results/reg_tf

The "reg_file" option is a simple text list of candidate regulators that are present in the expression matrix. If the regulators are discrete, it is mandatory to add a second column in the text file, describing the type of the regulator ("c" for continuous or "d" for discrete).

Note that this command will create four different output files, using the "output_file" parameter as the prefix for all the files.

reg_tf.topreg.txt: Top 1% regulators assigned to the modules.
reg_tf.allreg.txt: All the regulators assigned.
reg_tf.randomreg.txt: Regulators assigned randomly to the modules.
reg_tf.xml.gz: xml file containing all the regulatory trees used for assigning the regulators.

The regulators text files all have the same format: three columns representing respectively the regulator name, the module number and the score value.

Now, let's assign the microRNAs as regulators of the modules.

java -jar lemontree.jar -task regulators -data_file data/all.txt -reg_file data/mir_reg_list.txt -cluster_file tight_clusters.txt -output_file results/reg_mir

Note that the data file "all.txt" contains both the expression profiles for mRNAs and microRNAs.

Figures task

This task is creating one figure per module. The figure represent the expression values color-coded with a gradient ranging from dark blue (low expression values) to bright yellow (high expression values). All the module genes are in the lower panel while the top regulators for the different classes or types of regulators (if any) are displayed in the upper panel. A regulation trees is represented on top of the figure, with the different split points highlighted on the figure as vertical red lines. The name of each gene is displayed on the left of the figure.

java -jar lemontree.jar -task figures -top_regulators reg_files.txt -data_file data/all.txt -reg_file data/all_reg_list.txt -cluster_file tight_clusters.txt -tree_file results/reg_tf.xml.gz

Note that the "top_regulators" parameter is a simple text file listing the different top regulator files, for different types of regulators (such as transcription factors, microRNAs, etc.). The content of such a file could be:

results/reg_tf.topreg.txt
results/reg_mir.topreg.txt

All figures are generated to the eps (encapsulated postcript) format. On Unix system, you can convert the eps format to pdf with the command epstopdf. The conversion of all files can be done with this shell one-liner:

for f in *.eps; do epstopdf $f; done

Please note that all files are generated in the current directory. If you want to generate the files in a given directory, first create it, then move into this directory and launch the command above (with the correct path to files, of course).

GO annotation task

The goal of this task is to calculate the GO category enrichment for each module. We use the BiNGO library to calculate the statistics. We have to specify two GO annotation files that are describing the GO codes associated with the genes ("gene_association.goa_human") and another file describing the GO graph ("gene_onotlogy_ext.obo"). The GO graph file can be downloaded for various organisms from the GO website. The annotation file can be downloaded from the EBI ftp site (be sure to take the 'goa' file format, e.g. gene_association.goa_ref_human.157 for human, the other format is not supported for the moment). We also specify the set of genes that should be used as the reference for the calculation of the statistics, in this case the list of all the genes that are present on the microarray chip (file "all_gene_list"). The results are stored in the output file "go.txt".

java -jar lemontree.jar -task go_annotation -cluster_file tight_clusters.txt -go_annot_file gene_association.goa_human -go_ontology_file gene_ontology_ext.obo -go_ref_file all_gene_list -output_file go.txt

Revamp task

This task is aimed at maximizing the Bayesian co-expression clustering score of an existing module network while preserving the initial number of clusters, as described in Erola, Bonnet and Michoel - Gene Regulatory Networks, 2019 (arxiv). A threshold can be specified to avoid that genes are reassigned if the score gain is below this threshold and allowing the systematic tracking of the conservation and divergence of modules with respect to the initial partition. This task can be used to optimize an existing module network obtained with a different clustering algorithm, or to optimize an existing module network for a different data matrix. The “cluster_file” is a simple text clustering file, like the one obtained in “tight_clusters” step, and “reassign_thr” is the score gain threshold that must be reached to move a gene from one cluster to another.

java -jar lemontree.jar -task revamp -data_file data/expr_matrix.txt -cluster_file cluster_file.txt -reassign_thr 0.0 -output_file revamped_clusters.txt -node_clustering true

There's a multi-tissue version of "revamp" that incorporates prior information on physiological tissue similarity, as described in Erola, Björkegren and Michoel - Bioinformatics, 2020. To run this, we need to specify a "column_file" with prior tissue similarities. The format of this file is a simple text file with a numeric value [0..1] for each sample defining how much that sample will influence in the current clustering task. We suggest to derive the similarity coefficients using Pearson’s correlation, but other distance measures could be used.

java -jar lemontree.jar -task revamp -data_file data/expr_matrix.txt -cluster_file cluster_file.txt -reassign_thr 0.0 -output_file revamped_clusters.txt -column_file similarity_file.txt -node_clustering true

Advanced usage

With large data sets, the users might have running time issues. For the clustering task, it is fairly easy to run "ganesh" tasks in parallel on different computers or cores (on a cluster system, for example). For the assignment of regulators, in principle this kind of task can also be parallelized since the assignment is independent for each module. We have added two options to allow users to easily run regulator assignment in parallel.

The "experiments" task is sampling conditions for a given data set, and the results file can then be used to assign regulators in parallel with the "split_reg" task.

This task is using a special option, named "range", that is specifying the number of modules for which the assignment will be done. The parameter for the "range" option consist in two numbers separated by a column character. The two numbers indicate the range of modules for which the assignment will be made. Thus with different range options, users can run assignment tasks in parallel. It is also possible to do the assignment module by module, by using the same number on each side of the column character (i.e. "-range 5:5" will perform the assignment for module 5 only).

The two examples below, give the command line syntax to perform experiments sampling and then assign regulators for modules 5 to 10.

java -jar lemontree.jar -task experiments -data_file data/expr_matrix.txt  -cluster_file tight_clusters.txt -output_file exp.xml.gz

java -jar lemontree.jar -task split_reg -data_file data/expr_matrix.txt -reg_file data/reg_list.txt -cluster_file tight_clusters.txt -tree_file exp.xml.gz -range 5:10 -output_file reg_5_10

Building from source

To build the software from source, you can download anonymously the latest version (read-only) in the current directory with the following git command:

git clone https://github.com/erbon7/lemon-tree/ .

Then you can use the ant tool to build the jar file:

ant jar

If everything went well, you should see a file named lemontree_vx.x.x.jar in the current directory.

External programs

Many different algorithms can be used to cluster a matrix of gene expression data. For example, the R statistical software has literally hundreds of different package to perform this procedure (see a list here).

The results of the clustering procedure can be used in Lemon-Tree as the input for the regulators assignment task. The input data just need to be formatted correctly: a tab-delimited text file, with gene names (identification codes) in the first column and a cluster number in the second column.

Note that it is also possible to use the results of the probabilistic clustering procedure of Lemon-Tree as input for other programs (same format as described in the previous paragraph).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial

Copyright and License

Versioning and downloads#

Installation

Synopsis

Tutorial

Ganesh task

Tight clusters task

Regulators task

Figures task

GO annotation task

Revamp task

Advanced usage

Building from source

External programs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally