This guide outlines the steps to identify genetic variants in E. coli using publicly available tools and data. It is meant for educational use, particularly for school students learning about bioinformatics pipelines.
sudo apt install bwa samtools bcftoolsTools installed:
bwa: For aligning sequencing reads.samtools: For handling SAM/BAM files.bcftools: For variant calling.
mkdir -p /home/data/ref_genome
curl -L -o ecoli_rel606.fasta.gz \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
gunzip ecoli_rel606.fasta.gzcurl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
tar xvf sub.tar.gz
mv sub/ ~/This dataset contains paired-end FASTQ files.
bwa index ecoli_rel606.fastaIndexing prepares the reference for alignment.
bwa mem ecoli_rel606.fasta \
SRR2584866_1.trim.sub.fastq \
SRR2584866_2.trim.sub.fastq \
> SRR2584866.aligned.samOutput: SRR2584866.aligned.sam
samtools view -S -b SRR2584866.aligned.sam > SRR2584866.aligned.bamBAM files are binary and compressed.
samtools sort -o SRR2584866.aligned.sorted.bam SRR2584866.aligned.bam samtools index SRR2584866.aligned.sorted.bamSorting and indexing improve processing speed for the next steps.
samtools flagstat SRR2584866.aligned.sorted.bamThis gives summary metrics such as total reads and mapping rates.
bcftools mpileup -O b -o SRR2584866_raw.bcf \
-f ecoli_rel606.fasta \
SRR2584866.aligned.sorted.bambcftools call --ploidy 1 -m -v -o SRR2584866_variants.vcf SRR2584866_raw.bcf--ploidy 1: Because bacteria are haploid.- Output: VCF file listing SNPs/indels.
less -S SRR2584866_variants.vcfUse arrow keys to scroll horizontally if needed.
This workflow covers a full variant calling pipeline starting from raw reads through to variant identification. It introduces essential tools and formats (FASTQ, FASTA, SAM, BAM, BCF, VCF) in genome analysis.