PYPELINE TO GENERATE SYNTHETIC SAMPLES AND PERFORM ML ABOUT AMR

The Snakefile you can find in this directory was used to generate synthetic bulk DNA. Those are genrated in a classical way, without the emplyment of any ML generative model, but starging from reference genomes. Those synthetic samples have been used to train ML models that was tested on real data. The target of this inspection is quantify the AMR-bacteria abundance in the sample. For the samples generation, the pipeline relies on the software MetaGeSim https://github.com/EttoreRocchi/MetaGeSim-AMR (and its older version CAMISIM https://github.com/CAMI-challenge/CAMISIM) Note that the MetaGeSim version have been slightly modified for this project and the updates still needs to be pushed. To explain the functioning of the piepeline is usefull to devide the process in three:

Read library simulation
Allignemnt
Machine learning

Read library generation

The read library generation is the procedure that we implemented in order to obtain simulated read libraries starting from genome sequences (.fasta files). This job is handled by MetaGeSim. The command used is reported in the Snakefile under the rule "camisim". In this rule are generated the reads in the anonimus mode of the software and then is used a script in order to separe the backward and forward reads in the two pair end files. In order this command to work, you need to set up a population directory. In this folder will be stored all the data produced about that population. To automate the setting up, we implemented the rule "population set up". In this other rule is lunched a script that generates the input files required by MetaGeSim, starting from the information contained in input.json and input.tsv. If further automation is needed, the base informations can be stored in a row of "Database Campioni" and, if the folder is not alreadt present, a folder containing the input files is created lunching in python the script "Samples_form_database.py" Following this procedure, many paramethers of MetaGeSim execution are set with default values that we found appropriate for our goal. Anyway, if someone intend to modify those parameters should modify the scripts "my_scripts/generate_files.py" or "Samples_from_database.py" and follow camisim manual.

Once the directories are fully prepared, the following command can be used (in the AMR directory) to generate the read libraries: Paired end

snakemake -c8 population_folder/reads_1.fastq --use-conda

Single end

snakemake -c8 population_folder/reads adriano
.fastq --use-conda

If you have real DNA bulk samples that should be involved in the ML process, you can subasample them so that they have the same size of the simulated ones. This is implemented in the script "Samples_from_database.py" and requires to adapt some of the columns values. Notice that, even if the samples are real, the script and the database format expect the sample composition to be declared, so that the metadata can be created.

Allignemnt

The allignment is the procedure that associates the read libraries to the references from CARD and WildCARD. This procedure relyes on the software rgi in its bwt mode. The command used to run the allignment is reported in the rule "rgi_bowtie2" and requires as input the read libraries resulting from the grneration part. Rgi requires to set up an envoirment that include the borrows weeler transform of the reference database. This procedure can be performed just once for all, usually if is the firt time running the pipeline. If it is required, the rule "rgi_bowtie2" calls the rule "rgi_env_setup" that contains the rgi commands used in order to prepare this envoirment.

To obtain the perform the allignmet, the following command can be used (in the AMR directory) to generate the read libraries:

snakemake -c8 population_*/bwt_out/bwt.allele_mapping_data.json --use-conda

Rule all

If all the population folders are properly set up, you can call the output

snakemake -c8 TERMINATED --use-conda

To both symulate the samples and perform the alignement

Machine learning

This part of the pipeline is not included in the Snakefile, but is handeld by the script "M_learning.py" and its dependencies. In this scripts, the population

The snakefiles contains also other rules, which are not already functioning, so do not realy on them.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
M_learning_images		M_learning_images
MetaGeSim-AMR		MetaGeSim-AMR
genomes		genomes
my_scripts		my_scripts
.gitignore		.gitignore
Database_campioni2.csv		Database_campioni2.csv
Database_campioni_assembled.csv		Database_campioni_assembled.csv
Database_campioni_multispecie.csv		Database_campioni_multispecie.csv
Inspect_assembled_genomes.py		Inspect_assembled_genomes.py
M_learning_depth_effect.py		M_learning_depth_effect.py
M_learning_depth_effect_filters.py		M_learning_depth_effect_filters.py
M_learning_depth_effect_multispecie.py		M_learning_depth_effect_multispecie.py
M_learning_design_effect.py		M_learning_design_effect.py
M_learning_fancy_plots.py		M_learning_fancy_plots.py
M_learning_feature_imp.py		M_learning_feature_imp.py
M_learning_images.py		M_learning_images.py
PCA_samples_features.py		PCA_samples_features.py
Paper_materials.py		Paper_materials.py
README.md		README.md
Samples_from_database2.py		Samples_from_database2.py
Samples_from_database_assembled.py		Samples_from_database_assembled.py
Samples_from_database_multispecie.py		Samples_from_database_multispecie.py
Snakefile		Snakefile
Snellire_sampes.py		Snellire_sampes.py
Studio_features.py		Studio_features.py
env.yaml		env.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PYPELINE TO GENERATE SYNTHETIC SAMPLES AND PERFORM ML ABOUT AMR

Read library generation

Allignemnt

Rule all

Machine learning

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PYPELINE TO GENERATE SYNTHETIC SAMPLES AND PERFORM ML ABOUT AMR

Read library generation

Allignemnt

Rule all

Machine learning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages