DiaMet is a Python pipeline for taxonomic classification of undetermined reads from sequencing data. The pipeline assembles contigs using Megahit, performs protein-level classification using DIAMOND BLASTX against the Swiss-Prot database, and generates visualizations of the taxonomic distribution.
- Contig assembly with Megahit from undetermined reads
- Protein-level taxonomic classification using DIAMOND BLASTX against Swiss-Prot
- Duplicate removal to keep only the best hit per query sequence
- Taxonomic visualization with publication-ready bar plots
- Viral species identification with count summaries
- Automatic cleanup of intermediate files
- Swiss-Prot database in DIAMOND format (
swissprot.dmnd)
-
Clone this repository:
git clone https://github.com/yourusername/DiaMet.git cd DiaMet -
Install Python dependencies:
pip install pandas matplotlib
-
Ensure external tools are in your PATH or modify the script with the correct paths:
- Megahit
- DIAMOND
- seqkit
-
Update the DIAMOND database path in the script:
# Change this line to point to your Swiss-Prot database diamond_command = "/path/to/your/diamond blastx -d /path/to/swissprot.dmnd ..."
-
Make the script executable and accessible from anywhere:
chmod +x diamet.py sudo ln -s $(pwd)/diamet.py /usr/local/bin/diamet
-
Navigate to the directory containing your
undetermined_reads.fastq.gzfile:cd /path/to/your/data/directory -
Run the pipeline:
diamet
The script will:
- Create a
DiaMetoutput directory in your current working directory - Assemble contigs with Megahit
- Run DIAMOND BLASTX on the assembled contigs
- Run DIAMOND BLASTX on the original reads (ultra-sensitive mode)
- Remove duplicate hits
- Generate taxonomic classification plots
- Create a CSV file with viral species counts
- Clean up intermediate files
The script generates the following files in the DiaMet directory:
| File | Description |
|---|---|
undetermined_contigs_diamet.tsv |
DIAMOND results for assembled contigs (duplicates removed) |
undetermined_reads_diamet.tsv |
DIAMOND results for original reads (duplicates removed) |
undetermined_reads_diamet.pdf |
Taxonomic classification bar plot |
undetermined_reads_diamet_viral.csv |
Viral species counts |
The DIAMOND output files (TSV format) contain the following columns:
qseqid: Query sequence IDqlen: Query sequence lengthlength: Alignment lengthsscinames: Scientific names of subject sequencessskingdoms: Kingdom-level taxonomic classification
The pipeline generates a publication-ready bar plot showing the taxonomic distribution of classified reads:
- Colors:
- Eukaryota: Dark blue (#142E42)
- Bacteria: Teal (#108A8C)
- Viruses: Red (#A81F1B)
- Archaea: Orange (#EC9929)
- Features:
- Log-scale y-axis for better visualization of low-abundance taxa
- Grid lines for easy value estimation
- Percentage of classified reads displayed in subtitle
- Legend with taxonomic groups
- Clean, minimal styling
To change the colors for different taxonomic groups, modify the
custom_colors function:
def custom_colors(entries):
color_dict = {
'Eukaryota': (20/255, 54/255, 66/255), # Dark blue
'Bacteria': (16/255, 138/255, 140/255), # Teal
'Viruses': (168/255, 31/255, 27/255), # Red
'Archaea': (236/255, 153/255, 41/255) # Orange
}
return [color_dict[entry] for entry in entries]The script uses --ultra-sensitive mode for read-level classification.
To modify sensitivity:
# Change to faster mode
command = f"... --sensitive"
# Or to default mode
command = f"... " # Remove --ultra-sensitive- “undetermined_reads.fastq.gz not found”
- Ensure you’re in the correct directory containing your input file
- Check file name spelling
- DIAMOND database path errors
- Update the database path in both
run_megahit_and_diamond()andrun_diamond_blastx()functions - Ensure the database is in DIAMOND format
- Update the database path in both
- Missing external tools
- Verify Megahit, DIAMOND, and seqkit are installed and accessible
- Check PATH or use absolute paths in the script
- “diamet: command not found”
- Ensure the symbolic link was created correctly:
sudo ln -s $(pwd)/diamet.py /usr/bin/diamet - Verify that
/usr/binis in your PATH
- Ensure the symbolic link was created correctly: