-
Notifications
You must be signed in to change notification settings - Fork 9
ExtendLemonTree
We strongly encourage people to extend the functionalities of Lemon-Tree, by extending the codebase. Here, we describe the basic data structures and give some guidelines, assuming that the reader knows the fundamentals of programming in Java.
The complete java documentation (javadoc) is available for download from the home page. Most of the classes properties and functions are commented, so it is probably a good idea to start browsing this documentation to get familiar with the code organization.
A good entry point to understand how the code is organized and how it is working is to look at the class RunCli, which is parsing all the commands and creating objects and calling functions according to the different options specified for the task chosen.
Parsing of the tasks is done within if blocks, like this entry for the "ganesh" task:
// ---------------------------------------------------------------
// ganesh task: 2-way clustering of genes using the gibbs sampler
// ---------------------------------------------------------------
if (task.equalsIgnoreCase("ganesh")) {
In the body of the function, we can see that the clustering procedure is done in 5 essential steps:
- creation of an a
ModuleNetworkobject. - read data matrix.
- initialization of clustering parameters.
- two-way probabilistic clustering.
- write results to output file.
// Create ModuleNetwork object
ModuleNetwork M = new ModuleNetwork();
M.readExpressionMatrix(data_file, gene_file);
M.setNormalGammaPriors(lambda, mu, alpha, beta);
// Gibbs sample different module sets with one tree per module
M.gibbsSamplerGenes(init_num_clust, num_runs, num_steps, burn_in, sample_steps, score_gain, use_bayesian_score);
// write results to text file
M.writeClusters(output_file);
The data structures useful to implement a novel clustering algorithm are located in the ModuleNetwork class:
-
data: double 2-dimensional array storing expression data values. Rows correspond to genes, columns to experiments or samples. -
numCond: integer indicating the number of samples or experiments. -
numRows: integer value indicating the number of genes or rows in the data matrix. -
geneSet: list ofGeneobjects, with the gene name (e.g. HUGO codes) encoded in theGene.nameproperty.
Those structures can be used to implement a novel clustering algorithm in Lemon-Tree, following those steps:
- Create a novel class implementing the core algorithm.
- Create a novel function in the
ModuleNetworkclass, that can be called to run the new clustering algorithm. - Create a new entry in the
RunCliclass, specifying the command line options necessary to call the new algorithm.
Note that the output format for the final clustering solution is very simple: a tab-delimited file with a gene name and cluster number for all genes.
The ModuleNetwork.writeClusters() function can be called to write the clustering results. This function is looking at the moduleSet property, which is a list of Module objects (= clusters) containing the cluster number (.number property) and the list of genes (.genes objects for this module.
The implementation for this task follows the same steps as descibed for the clustering algorithm:
- Create a novel class implementing the core algorithm.
- Create a novel function in the
ModuleNetworkclass, that can be called to run the new clustering algorithm. - Create a new entry in the
RunCliclass, specifying the command line options necessary to call the new algorithm.
The data structures described in the previous paragraph can be re-used for regulator assignment, especially the moduleSet structure containing the list of modules.
Developpers can have a look at the code for the "regulators" task in the RunCli class, implementing the default algorithm for this step.
ModuleNetwork M = new ModuleNetwork();
M.setNormalGammaPriors(lambda, mu, alpha, beta);
M.readExpressionMatrix(data_file, null);
M.readClusters(cluster_file);
M.readRegulators(reg_file);
M.initStatisticsAndScore();
M.setDataMeanAndSDFromModuleset();
// cluster experiments using the gibbs sampler
M.gibbsSamplerExpts(num_runs, num_steps, burn_in, sample_steps, score_gain, use_bayesian_score);
// assign regulators
M.assignRegulatorsNoAcyclStoch(beta_reg, num_reg);
// write results as text file with all regulators, top 1%, random regulators and regulations trees as xml
M.printRegulators(output_file+".allreg.txt", true, false);
M.printRegulators(output_file+".topreg.txt", false, false);
The steps for the regulators assignment procedure can be described as:
- set initial parameters.
- load expression matrix.
- load clusters definition.
- load list of regulators.
- initialize statistics.
- assign regulators.
- print regulator results files.
A data structure from the ModuleNetwork class that will be useful for a novel regulator assignment algorithm is the regulatorSet property, a list of Gene objects representing the candidate regulators that will be used during the assignment procedure.
For each Module object, the list of assigned regulators is encoded in the regulatorWeights object, a map structure associating a gene object and a score value. This data structure can be used to store the regulators assigned with the new method and for creating the final output file.
The output format after the regulator assignment is also a tab-delimited text file with 3 fields: regulator (gene) name, module number, and a regulator score (see the printRegulators function for more details).
Finally, developers should incorporate a new entry in the RunCli class, handling the call to the new assignment algorithm, with the appropriate parameters.