This repository contains the source code and supplementary material for the study titled "DNA Sequence Vectorization for Bacterial Species and Subspecies Delimitation Using Machine Learning Classification Methods."
Note: This project uses kmertools4ml for k-mer counting. It is recommended that you install and build kmertools4ml before installing this code.
The precise delimitation of bacterial species and subspecies remains a fundamental challenge in biology and microbiology. Traditional methods based on biological traits often fail to capture the genetic diversity and complexity of microbial life. This project aims to use alignment-free sequence methods to delimit bacterial species and subspecies through the vectorization of DNA sequences based on nucleotide frequencies and machine learning classification methods.
This study aims to use alignment-free sequence methods to delimit bacterial species and subspecies through the vectorization of DNA sequences based on nucleotide frequencies and machine learning classification methods. We utilized the GTDB database Release 09-RS220 to calculate nucleotide frequencies and applied data transformation of sequences to our analysis. Preliminary results demonstrate the potential effectiveness of these methods in accurately distinguishing between different bacterial species and subspecies.
Before installing this project, please ensure that you have installed and built kmertools4ml, as it is used for k-mer counting in this workflow.
git clone https://github.com/aaguirreu/Kml.git
cd KmlSet up your Conda environment as follows:
conda env create -f environment.yml
conda activate kmlKmer Frequency Counting Command: This command performs k-mer frequency counting using the kmertools4ml utility, which calculates the frequency of subsequences of length k (between 2-10) from the DNA sequences in the specified directory:
kml -k 2-10 -d path-to-sequences-folder -o output-folder-pathVectorization and Model Evaluation Command: This command performs vectorization of the kmer frequencies and evaluates the data using all available machine learning models:
kml -k 2-10 -va -ma -d path-to-sequences-folder -o output-folder-pathThe flags -va and -ma enable vectorization analysis and model analysis respectively, processing the data through the complete machine learning pipeline.
Specific Model and Vectorization Method: This command uses a specific vectorization method (k-mer) and machine learning model (Random Forest):
kml -k 2-10 -v kmer -m rf -d path-to-sequences-folder -o output-folder-pathWhere:
-v kmer: Specifies k-mer vectorization method-m rf: Uses Random Forest model
We used genomes from the Genome Taxonomy Database (GTDB) Release 09-RS220. The genomes were processed to calculate k-mer frequencies, which were then transformed into numerical vectors using the TF-IDF approach. These vectors were used as input for various machine learning models, including Random Forest, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN).
The project is implemented in Python and leverages several key libraries:
- BioPython for biological sequence manipulation.
- NumPy for numerical operations.
- Scikit-learn for machine learning and vectorization.
- Matplotlib for plotting and visualizations.
- Pandas for data manipulation.
- K-mer Frequencies Calculation: Extract k-mers from DNA sequences and count their occurrences using kmertools4ml.
- Vectorization: Transform k-mer counts into TF-IDF vectors.
- Machine Learning Models: Train and evaluate models to classify DNA sequences.
- Alvaro Aguirre Ulloa – Escuela de Informática, Facultad de Ingeniería, Universidad Tecnológica Metropolitana, Santiago, Chile
- Jorge R. Vergara – Departamento de Informática y Computación, Facultad de Ingeniería, Universidad Tecnológica Metropolitana de Chile, Santiago, Chile
- Diego Fuentealba – Departamento de Informática y Computación, Facultad de Ingeniería, Universidad Tecnológica Metropolitana de Chile, Santiago, Chile
- Raquel Quatrini – Centro Científico y Tecnológico de Excelencia Ciencia & Vida, Santiago 8580704, Chile. Facultad de Ciencias, Universidad San Sebastián, Santiago 7510602, Chile
- Ana Moya-Beltrán – Departamento de Informática y Computación, Facultad de Ingeniería, Universidad Tecnológica Metropolitana de Chile, Santiago, Chile
- M. Vences, A. Miralles, and C. Dufresnes, “Next-generation species delimitation and taxonomy: Implications for biogeography,” Journal of Biogeography, Feb. 2024.
- B. DAYRAT, “Towards integrative taxonomy: Integrative taxonomy,” Biological Journal of the Linnean Society, vol. 85, no. 3, p. 407–415, Jun. 2005.
- S. K. Gouda, K. Kumari, A. N. Panda, and V. Raina, “Computational Tools for Whole Genome and Metagenome Analysis of NGS Data for Microbial Diversity Studies,” Elsevier, 2024.
This project is licensed under the MIT License. See the LICENSE file for details.