DNA Sequence Vectorization for Bacterial Species and Subspecies Delimitation

This repository contains the source code and supplementary material for the study titled "DNA Sequence Vectorization for Bacterial Species and Subspecies Delimitation Using Machine Learning Classification Methods."

Note: This project uses kmertools4ml for k-mer counting. It is recommended that you install and build kmertools4ml before installing this code.

Introduction

The precise delimitation of bacterial species and subspecies remains a fundamental challenge in biology and microbiology. Traditional methods based on biological traits often fail to capture the genetic diversity and complexity of microbial life. This project aims to use alignment-free sequence methods to delimit bacterial species and subspecies through the vectorization of DNA sequences based on nucleotide frequencies and machine learning classification methods.

Abstract

This study aims to use alignment-free sequence methods to delimit bacterial species and subspecies through the vectorization of DNA sequences based on nucleotide frequencies and machine learning classification methods. We utilized the GTDB database Release 09-RS220 to calculate nucleotide frequencies and applied data transformation of sequences to our analysis. Preliminary results demonstrate the potential effectiveness of these methods in accurately distinguishing between different bacterial species and subspecies.

Installation

Before installing this project, please ensure that you have installed and built kmertools4ml, as it is used for k-mer counting in this workflow.

Clone Repository

git clone https://github.com/aaguirreu/Kml.git
cd Kml

Conda Environment

Set up your Conda environment as follows:

conda env create -f environment.yml
conda activate kml

Usage

Kmer Frequency Counting Command: This command performs k-mer frequency counting using the kmertools4ml utility, which calculates the frequency of subsequences of length k (between 2-10) from the DNA sequences in the specified directory:

kml -k 2-10 -d path-to-sequences-folder -o output-folder-path

Vectorization and Model Evaluation Command: This command performs vectorization of the kmer frequencies and evaluates the data using all available machine learning models:

kml -k 2-10 -va -ma -d path-to-sequences-folder -o output-folder-path

The flags -va and -ma enable vectorization analysis and model analysis respectively, processing the data through the complete machine learning pipeline.

Specific Model and Vectorization Method: This command uses a specific vectorization method (k-mer) and machine learning model (Random Forest):

kml -k 2-10 -v kmer -m rf -d path-to-sequences-folder -o output-folder-path

Where:

-v kmer: Specifies k-mer vectorization method
-m rf: Uses Random Forest model

Methods

Dataset

We used genomes from the Genome Taxonomy Database (GTDB) Release 09-RS220. The genomes were processed to calculate k-mer frequencies, which were then transformed into numerical vectors using the TF-IDF approach. These vectors were used as input for various machine learning models, including Random Forest, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN).

Implementation

The project is implemented in Python and leverages several key libraries:

BioPython for biological sequence manipulation.
NumPy for numerical operations.
Scikit-learn for machine learning and vectorization.
Matplotlib for plotting and visualizations.
Pandas for data manipulation.

Key Components

K-mer Frequencies Calculation: Extract k-mers from DNA sequences and count their occurrences using kmertools4ml.
Vectorization: Transform k-mer counts into TF-IDF vectors.
Machine Learning Models: Train and evaluate models to classify DNA sequences.

Contributors

Alvaro Aguirre Ulloa – Escuela de Informática, Facultad de Ingeniería, Universidad Tecnológica Metropolitana, Santiago, Chile
Jorge R. Vergara – Departamento de Informática y Computación, Facultad de Ingeniería, Universidad Tecnológica Metropolitana de Chile, Santiago, Chile
Diego Fuentealba – Departamento de Informática y Computación, Facultad de Ingeniería, Universidad Tecnológica Metropolitana de Chile, Santiago, Chile
Raquel Quatrini – Centro Científico y Tecnológico de Excelencia Ciencia & Vida, Santiago 8580704, Chile. Facultad de Ciencias, Universidad San Sebastián, Santiago 7510602, Chile
Ana Moya-Beltrán – Departamento de Informática y Computación, Facultad de Ingeniería, Universidad Tecnológica Metropolitana de Chile, Santiago, Chile

References

M. Vences, A. Miralles, and C. Dufresnes, “Next-generation species delimitation and taxonomy: Implications for biogeography,” Journal of Biogeography, Feb. 2024.
B. DAYRAT, “Towards integrative taxonomy: Integrative taxonomy,” Biological Journal of the Linnean Society, vol. 85, no. 3, p. 407–415, Jun. 2005.
S. K. Gouda, K. Kumari, A. N. Panda, and V. Raina, “Computational Tools for Whole Genome and Metagenome Analysis of NGS Data for Microbial Diversity Studies,” Elsevier, 2024.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
kml		kml
test		test
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNA Sequence Vectorization for Bacterial Species and Subspecies Delimitation

Table of Contents

Introduction

Abstract

Installation

Clone Repository

Conda Environment

Usage

Methods

Dataset

Implementation

Key Components

Contributors

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DNA Sequence Vectorization for Bacterial Species and Subspecies Delimitation

Table of Contents

Introduction

Abstract

Installation

Clone Repository

Conda Environment

Usage

Methods

Dataset

Implementation

Key Components

Contributors

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages