-
Notifications
You must be signed in to change notification settings - Fork 5
ClusterScan Tutorial
If you have not already done yet, please follow our ClusterScan Installation guide and then come back here.
ClusterScan comes with an example dataset useful for this tutorial and to learn how to format your own data. If you have followed the installation guide, you can find a folder named tutorial in the ClusterScan-master folder. This folder contains the dataset which is composed by three files:
| File | Description | Source |
|---|---|---|
| Homo_sapiens.GRCh38.85_genes.bed | BED formatted file storing all human protein coding genes. | obtained from the Ensembl GFF3 annotation (assembly version: GRCh38; Ensembl version: 85) filtering for protein coding genes (type: “gene”; biotype: “protein_coding”). |
| Homo_sapiens.GRCh38.85_Pfam.txt | two-column tab-delimited file storing the Pfam domain annotations for each protein coding gene. | obtained through the BioMart web server, with filters on "Gene type" ("protein_coding") and "Attributes" ("Gene stable ID" and "Pfam domain ID"). |
| Pfam-descriptors.txt | optional two-columns, tab-delimited file storing all Pfam accession descriptions. | based on the Pfam domain database. |
You can take a look at these files in order to learn how to format your own input files.
If you have added ClusterScan in your path by modifying the bashrc file as described in the installation guide, you can run ClusterScan from any folder by typyng:
clusterscan.py -h
if you have not, you must call python before the ClusterScan script pointing to the directory in which the principal script resides:
python <path-to-clusterscan>/ClusterScan-master/clusterscan.py -h
You can run your first ClusterScan analysis, and check that the program is working properly, by entering your work directory and typing the code below in your terminal:
clusterscan.py clusterdist <path-to-clusterscan>/ClusterScan-master/tutorial/Homo_sapiens.GRCh38.85_genes.bed <path-to-clusterscan>/ClusterScan-master/tutorial/Homo_sapiens.GRCh38.85_Pfam.txt --info <path-to-clusterscan>/ClusterScan-master/tutorial/Pfam-descriptors.txt -o human_test -a human_01
This will starts a new ClusterScan analysis and create the human_test folder into your work directory.
If everything goes well, you should have the following files in the human_test folder:
| File | Description |
|---|---|
| human_01_clusters.tsv | stores the coordinates of all the clusters found. It contains an unique ID for each cluster; the category to which the cluster belongs; the chromosome/scaffold (chr) on which it resides; its start and end coordinates; the number of features within the cluster (n_features) and the number of features which overlap the cluster but belong to a different category (n_bystanders). |
| human_01_clusters.bed | a bed version of the previous file for compatibility with other tools for downstream analysis. Fields respect the bed format and are (from first to last column): chromosome/scaffold on which the cluster resides; its start/end coordinates using a 0-based start and 1-based end system of coordinates; the cluster ID; the number of features within the cluster; the strand; the category describing the cluster. |
| human_01_summary.tsv | contains a category based summary of the cluster analysis with the category; the number of clusters found for each category (n_clusters); the total number of features found for clusters belonging to each category (n_ft); the number of bystanders (n_bs); the number of features and bystanders found in the clusters which contains the minimal and the maximal number of these features (max_ft and min_ft / max_bs and min_bs) for each category. |
| human_01_features.tsv | a list of features found to be in overlap with clusters. It contains the chromosome/scaffold (chr) on which the feature resides; its start and end coordinates; the feature name; the strand; the ID of the cluster to which it overlaps; the category to which the cluster (and the feature) belongs. |
| human_01_bystanders.tsv | a list of bystanders found to be in overlap with clusters. It contains the chromosome/scaffold (chr) on which the feature resides; its start and end coordinates; the feature name; the strand; the ID of the cluster to which it overlaps; the category to which the cluster (but no the feature) belongs. |
| human_01_distribution.pdf | an histogram which shows the distribution of features in the top-10 categories by number of features. |
There is also a folder named results provided with ClusterScan in the tutorial folder. Here you can find pre-calculated output files for exactly the same analysis in order to check your output by comparing the six files obtained.