Skip to content

happykhan/qualibact

Repository files navigation

QualiBact

Source code for QualiBact. Analyzing microbial genome assembly statistics across multiple species. It compares allthebacteria assemblies to NCBI RefSeq assemblies and generates detailed statistics, outlier detection with Isolation Forest, and visualizations.

Website: https://happykhan.github.io/qualibact/

This is the criteria used in Speccheck: https://github.com/happykhan/speccheck

Contributing to QualiBact

We welcome contributions to QualiBact! Major contributions to source code, manuscript development, metric validation and adoption, and providing additional data for calibrating quality thresholds will be granted authorship on publications. Pull requests are welcome through GitHub.

Please Read CONTRIBUTING

📊 Features

  • Genome Assembly Analysis: Parses genome assembly statistics across multiple species
  • Comparative Analysis: Compares allthebacteria assemblies to NCBI RefSeq assemblies
  • Outlier Detection: Uses Isolation Forest for anomaly detection
  • Data Visualization: Generates comprehensive plots and statistics
  • Data Processing: Includes utilities for merging and processing TSV files
  • Web content: Produces markdown files for publishing to the website (https://happykhan.github.io/qualibact/)

🚀 Usage

Main Analysis Script

python qualibact-run.py \
    --workdir <working_directory> \
    --species_file <species_file.txt> \
    --min_genome_count <min_count>

📤 Output

Analysis Results

  • Per-species plots (*.png) and CSV summaries (summary.csv, selected_summary.csv)
  • Combined summary tables across species (all_metrics.csv, all_metrics_summary.csv)
  • Outlier visualizations with anomaly scores and joint KDEs

Data Processing

  • merged.tsv: Combined TSV file with filename tracking (generated by make_copy.sh)
  • GC content analysis results in output_compare_gc/ directory

�️ Additional Scripts

  • make_copy.sh: Utility script for merging multiple TSV files with filename tracking
  • find_failed_jobs.py: Helper script for identifying failed processing jobs
  • gc_refseq/do_gc_refseq.py: GC content analysis for RefSeq data

�📦 Dependencies

  • Python ≥ 3.7
  • pandas
  • numpy
  • seaborn
  • matplotlib
  • scipy
  • scikit-learn

Install them with:

pip install -r requirements.txt

📬 Contributing

If you find issues or have suggestions, feel free to open an issue or submit a pull request!

About

QC cutoffs for different genomes.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages