DiaMet

DiaMet is a Python pipeline for taxonomic classification of undetermined reads from sequencing data. The pipeline assembles contigs using Megahit, performs protein-level classification using DIAMOND BLASTX against the Swiss-Prot database, and generates visualizations of the taxonomic distribution.

Features

Contig assembly with Megahit from undetermined reads
Protein-level taxonomic classification using DIAMOND BLASTX against Swiss-Prot
Duplicate removal to keep only the best hit per query sequence
Taxonomic visualization with publication-ready bar plots
Viral species identification with count summaries
Automatic cleanup of intermediate files

Requirements

Dependencies

Python 3.6+
Required Python packages:
```
pandas
matplotlib
```
External tools:
- Megahit
- DIAMOND
- seqkit

Database

Swiss-Prot database in DIAMOND format (swissprot.dmnd)

Installation

Clone this repository:

git clone https://github.com/yourusername/DiaMet.git
cd DiaMet

Install Python dependencies:
```
pip install pandas matplotlib
```
Ensure external tools are in your PATH or modify the script with the correct paths:
- Megahit
- DIAMOND
- seqkit

Update the DIAMOND database path in the script:

# Change this line to point to your Swiss-Prot database
diamond_command = "/path/to/your/diamond blastx -d /path/to/swissprot.dmnd ..."

Make the script executable and accessible from anywhere:

chmod +x diamet.py
sudo ln -s $(pwd)/diamet.py /usr/local/bin/diamet

Usage

Navigate to the directory containing your undetermined_reads.fastq.gz file:
```
cd /path/to/your/data/directory
```
Run the pipeline:
```
diamet
```

The script will:

Create a DiaMet output directory in your current working directory
Assemble contigs with Megahit
Run DIAMOND BLASTX on the assembled contigs
Run DIAMOND BLASTX on the original reads (ultra-sensitive mode)
Remove duplicate hits
Generate taxonomic classification plots
Create a CSV file with viral species counts
Clean up intermediate files

Output Files

The script generates the following files in the DiaMet directory:

File	Description
`undetermined_contigs_diamet.tsv`	DIAMOND results for assembled contigs (duplicates removed)
`undetermined_reads_diamet.tsv`	DIAMOND results for original reads (duplicates removed)
`undetermined_reads_diamet.pdf`	Taxonomic classification bar plot
`undetermined_reads_diamet_viral.csv`	Viral species counts

Output Format Details

The DIAMOND output files (TSV format) contain the following columns:

qseqid: Query sequence ID
qlen: Query sequence length
length: Alignment length
sscinames: Scientific names of subject sequences
sskingdoms: Kingdom-level taxonomic classification

Visualization

The pipeline generates a publication-ready bar plot showing the taxonomic distribution of classified reads:

Colors:
- Eukaryota: Dark blue (#142E42)
- Bacteria: Teal (#108A8C)
- Viruses: Red (#A81F1B)
- Archaea: Orange (#EC9929)
Features:
- Log-scale y-axis for better visualization of low-abundance taxa
- Grid lines for easy value estimation
- Percentage of classified reads displayed in subtitle
- Legend with taxonomic groups
- Clean, minimal styling

Example Output

Customization

Modifying Taxa Colors

To change the colors for different taxonomic groups, modify the custom_colors function:

def custom_colors(entries):
    color_dict = {
        'Eukaryota': (20/255, 54/255, 66/255),   # Dark blue
        'Bacteria': (16/255, 138/255, 140/255),  # Teal
        'Viruses': (168/255, 31/255, 27/255),    # Red
        'Archaea': (236/255, 153/255, 41/255)    # Orange
    }
    return [color_dict[entry] for entry in entries]

Changing DIAMOND Sensitivity

The script uses --ultra-sensitive mode for read-level classification. To modify sensitivity:

# Change to faster mode
command = f"... --sensitive"

# Or to default mode
command = f"... "  # Remove --ultra-sensitive

Troubleshooting

Common Issues

“undetermined_reads.fastq.gz not found”
- Ensure you’re in the correct directory containing your input file
- Check file name spelling
DIAMOND database path errors
- Update the database path in both run_megahit_and_diamond() and run_diamond_blastx() functions
- Ensure the database is in DIAMOND format
Missing external tools
- Verify Megahit, DIAMOND, and seqkit are installed and accessible
- Check PATH or use absolute paths in the script
“diamet: command not found”
- Ensure the symbolic link was created correctly: sudo ln -s $(pwd)/diamet.py /usr/bin/diamet
- Verify that /usr/bin is in your PATH

Acknowledgments

Megahit for efficient contig assembly
DIAMOND for fast protein-level classification
seqkit for sequence manipulation
Swiss-Prot database for protein sequences and annotations

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
DiaMet.Rproj		DiaMet.Rproj
README.Rmd		README.Rmd
README.md		README.md
diamet.py		diamet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiaMet

Features

Requirements

Dependencies

Database

Installation

Usage

Output Files

Output Format Details

Visualization

Example Output

Customization

Modifying Taxa Colors

Changing DIAMOND Sensitivity

Troubleshooting

Common Issues

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DiaMet

Features

Requirements

Dependencies

Database

Installation

Usage

Output Files

Output Format Details

Visualization

Example Output

Customization

Modifying Taxa Colors

Changing DIAMOND Sensitivity

Troubleshooting

Common Issues

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages