Skip to content
Freya Arthen edited this page Feb 20, 2024 · 6 revisions

gene_info

  • summary.txt Lists the main assembly metrics (i.e. numbers as well as mean, SD, min and max of length, GC content and coverage) on contig and gene level
  • raw_gene_table.csv Contains all variables with values as they have been computed on the input data set (for detailed description of each variable, see this section
  • imputed_gene_table.csv The same as raw_gene_table.txt|.csv except for the variables 'c_genecovsd', 'c_genelensd', 'g_covdev_c', 'g_gcdev_c', 'g_lendev_c' which are rescaled to a range of 0 to 1 NaN (= missing) values are imputed with the mean of the respective variable

taxonomic_assignment

The label in the plots, which represents the query species, is automatically determined and always colored in a dark grey

  • 3D_plot.html Interactive 3D scatterplot to examine genes and their taxonomic assignments
    • with single-clicks on labels you can hide individual groups
    • double-clicks hide every group except for the one that was clicked
    • hovering over data points shows additional information
    • the subdirectory “3D_plot_files” holds additional files for this plot and is required to display the plot (important when working with MobaXterm for example)
  • gene_table_taxon_assignment.csv raw_gene_table with PCA coordinates for each gene and their taxonomic assignment appended
    • this is a tabular representation of all information that is displayed in the 3D plot
    • see Additional information for details on the contained information
  • taxsun.tsv Input file to explore the taxonomic assignment of the genes with taxSun

PCA

  • contribution_of_variables.png|.pdf Figure illustrating how much each variable contributes to the first two principal components
  • genes_and_variables.png|.pdf Biplot of variables (vectors) and genes (points) in the new coordinate system defined by the first two principal components. Transparency represents the amount of contribution to the principal components
  • pca_loadings.csv Table listing the loadings of the original variables (rows) on the computed principal components (columns)
  • pca_summary.csv Table listing standard deviation, proportion of explained variance and cumulative proportion of explained variance in the original data for each of the principal components
  • scree_plot.png|.pdf Scree plot visualising the amount of variance in the original data that is explained by each of the principal components (here: dimensions)
  • parallel_analysis.png|.pdf Only available if parallel analysis was performed on the principal components. Results of Horn’s parallel analysis: plotting random eigenvalues for the given number of PCs, adjusted and unadjusted eigenvalues, indicating which one were retained for the subsequent PCA

taxonomic_hits.txt

This file is the output of DIAMOND and holds the exact hits for each protein. The columns are the following:

  • qseqid Query Sequence ID (equal column fasta_header in 'gene_table_taxon_assignment.csv')
  • sseqid Subject Sequence ID - accession number of the matched protein in DB
  • pident Percentage of identical matches
  • length Alignment length
  • mismatch Number of mismatches
  • gapopen Number of gap openings
  • qstart Start of alignment in query
  • qend End of alignment in query
  • sstart Start of alignment in subject
  • send End of alignment in subject
  • evalue Expect value
  • bitscore Bit score
  • staxids Unique Subject Taxonomy ID(s), separated by a ’;’ (in numerical order)
  • ssciname Unique Subject Scientific Name(s), separated by a ';'

Clone this wiki locally