-
Notifications
You must be signed in to change notification settings - Fork 7
Non-microbial organisms in training database #30
Copy link
Copy link
Open
Labels
Description
There are a number of non-microbial organisms in the training database. This is significantly slowing down the training step, as CMash was designed with small microbial organisms in mind For example, I find a lot of Eukaryota (plants and the like):
| taxonomy | file name | compressed file size |
|---|---|---|
| Eukaryota,Viridiplantae,Streptophyta | taxid_69332_genomic.fna.gz | 457M |
| Eukaryota,Sar,Alveolata | taxid_1563115_genomic.fna.gz | 234M |
| Eukaryota,Sar,Alveolata | taxid_2951_genomic.fna.gz | 229M |
| Eukaryota,Sar,Alveolata | taxid_1563116_genomic.fna.gz | 207M |
| Eukaryota,Sar,Alveolata | taxid_1280413_genomic.fna.gz | 190M |
| Eukaryota,Sar,Stramenopiles | taxid_88149_genomic.fna.gz | 169M |
| Eukaryota,Opisthokonta,Fungi | taxid_44941_genomic.fna.gz | 163M |
| Eukaryota,Sar,Alveolata | taxid_1172189_2_genomic.fna.gz | 145M |
| Eukaryota,Sar,Stramenopiles | taxid_4781_0_genomic.fna.gz | 106M |
| Eukaryota,Rhodophyta,Florideophyceae | taxid_38544_genomic.fna.gz | 104M |
| Eukaryota,Sar,Stramenopiles | taxid_162140_1_genomic.fna.gz | 93M |
| Eukaryota,Opisthokonta,Fungi | taxid_462795_0_genomic.fna.gz | 92M |
| Eukaryota,Sar,Stramenopiles | taxid_162130_1_genomic.fna.gz | 91M |
| Eukaryota,Viridiplantae,Chlorophyta | taxid_3046_genomic.fna.gz | 89M |
| Eukaryota,Viridiplantae,Chlorophyta | taxid_36881_genomic.fna.gz | 83M |
In case you're interested in reproducing, this was done with ETE3 via:
paste -d'|' <(ls -S /data/dmk333/repos/Metalign/data/organism_files | head -n 15 | cut -d'_' -f2 | xargs -I{} sh -c "ete3 ncbiquery --search {} --info | cut -d',' -f3-5 | sed -n 2p") <(ls -S /data/dmk333/repos/Metalign/data/organism_files | head -n 15) <(ls -S /data/dmk333/repos/Metalign/data/organism_files | head -n 15 | xargs -I{} sh -c "du -sh /data/dmk333/repos/Metalign/data/organism_files/{} | cut -f1") | sed 's/^/|/g' | sed 's/$/|/g'
Given that the median file compressed organism_files file is 1.012MB, these are definitely outliers.
Check median via:
find /data/dmk333/repos/Metalign/data/organism_files -name "*.gz" | xargs -I{} du -s {} | sort -n | awk -f median.awk
with median.awk:
#/usr/bin/env awk
{
count[NR] = $1;
}
END {
if (NR % 2) {
print count[(NR + 1) / 2];
} else {
print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
}
}
Reactions are currently unavailable