Output

gene_info

summary.txt Lists the main assembly metrics (i.e. numbers as well as mean, SD, min and max of length, GC content and coverage) on contig and gene level
raw_gene_table.csv Contains all variables with values as they have been computed on the input data set (for detailed description of each variable, see this section
imputed_gene_table.csv The same as raw_gene_table.txt|.csv except for the variables 'c_genecovsd', 'c_genelensd', 'g_covdev_c', 'g_gcdev_c', 'g_lendev_c' which are rescaled to a range of 0 to 1 NaN (= missing) values are imputed with the mean of the respective variable

The label in the plots, which represents the query species, is automatically determined and always colored in a dark grey

3D_plot.html Interactive 3D scatterplot to examine genes and their taxonomic assignments
- with single-clicks on labels you can hide individual groups
- double-clicks hide every group except for the one that was clicked
- hovering over data points shows additional information
- the subdirectory “3D_plot_files” holds additional files for this plot and is required to display the plot (important when working with MobaXterm for example)
gene_table_taxon_assignment.csv raw_gene_table with PCA coordinates for each gene and their taxonomic assignment appended
- this is a tabular representation of all information that is displayed in the 3D plot
- see Additional information for details on the contained information
taxsun.tsv Input file to explore the taxonomic assignment of the genes with taxSun

contribution_of_variables.png|.pdf Figure illustrating how much each variable contributes to the first two principal components
genes_and_variables.png|.pdf Biplot of variables (vectors) and genes (points) in the new coordinate system defined by the first two principal components. Transparency represents the amount of contribution to the principal components
pca_loadings.csv Table listing the loadings of the original variables (rows) on the computed principal components (columns)
pca_summary.csv Table listing standard deviation, proportion of explained variance and cumulative proportion of explained variance in the original data for each of the principal components
scree_plot.png|.pdf Scree plot visualising the amount of variance in the original data that is explained by each of the principal components (here: dimensions)
parallel_analysis.png|.pdf Only available if parallel analysis was performed on the principal components. Results of Horn’s parallel analysis: plotting random eigenvalues for the given number of PCs, adjusted and unadjusted eigenvalues, indicating which one were retained for the subsequent PCA

This file is the output of DIAMOND and holds the exact hits for each protein. The columns are the following:

qseqid Query Sequence ID (equal column fasta_header in 'gene_table_taxon_assignment.csv')
sseqid Subject Sequence ID - accession number of the matched protein in DB
pident Percentage of identical matches
length Alignment length
mismatch Number of mismatches
gapopen Number of gap openings
qstart Start of alignment in query
qend End of alignment in query
sstart Start of alignment in subject
send End of alignment in subject
evalue Expect value
bitscore Bit score
staxids Unique Subject Taxonomy ID(s), separated by a ’;’ (in numerical order)
ssciname Unique Subject Scientific Name(s), separated by a ';'