Assignment 3: Genomic Databases and Advanced Applications (CCA5)

Project Overview

This repository contains the complete implementation for Assignment 3: Genomic Databases and Advanced Applications (CCA5).
The project progressively builds a suite of bioinformatics tools in Python — covering everything from efficient FASTA file handling and algorithmic problem-solving (sequence alignment, pattern matching, etc.) to advanced performance optimization and professional-grade command-line interface (CLI) development.

The final deliverable is a set of well-documented, tested, and high-performance Python modules.

Setup and Installation

Prerequisites

Python 3.8+
pip package manager

Dependencies

This project uses standard scientific computing libraries along with the tqdm library for progress monitoring.

Repository Structure The repository is modular — each Python file corresponds to a specific assignment question for maximum clarity and reusability.

File / Folder	Assignment Question	Description
`cli_tool.py`	Q6 (Integration)	Main executable file implementing the CLI, logging, and error handling.
`fasta_handlers.py`	Q1 (FASTA Processing)	Efficient, generator-based parser for large FASTA files.
`multiple_sequence_analysis.py`	Q3 (Multiple Sequence)	Implements global alignment, LCS, P-distance, and consensus sequence generation.
`advanced_pattern_analysis.py`	Q4 (Pattern Analysis)	Algorithms for suffix array construction, palindrome/repeat detection, and greedy assembly simulation.
`performance_scalability.py`	Q5 (Performance)	Demonstrates memory-efficient generators and multiprocessing with `tqdm`.
`test_*.py` files	Q3–Q5 (Testing)	Unit test files validating functionality for all major algorithms.
`test_sequences.fasta`	Data	Sample FASTA input file for testing the CLI.
`genomic_tool.log`	Q6 (Logging)	Automatically generated runtime log file.

Key Algorithms Implemented

Question	Requirement	Implementation	Time Complexity
Q3	Sequence Alignment	Needleman–Wunsch Global Alignment	O(M × N)
Q3	Common Subsequences	Longest Common Subsequence (LCS) via 2D DP	O(N × L²) for N sequences of length L
Q4	Pattern Matching	Suffix Array + Binary Search for pattern lookup	O(L log L)
Q4	Assembly Simulation	Greedy OLC (Overlap–Layout–Consensus)	O(N² × L) (worst case)
Q5	Memory Optimization	Python Generators for streaming large data	O(1) memory
Q5	Parallel Execution	`multiprocessing.Pool` + `tqdm`	≈ O(L / P) where P = # of cores

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
__pycache__		__pycache__
README.md		README.md
advanced_pattern_analysis.py		advanced_pattern_analysis.py
cli_tool.py		cli_tool.py
fasta_handlers.py		fasta_handlers.py
genomic_db.py		genomic_db.py
genomic_tool.log		genomic_tool.log
mulitple_sequence_analysis.py		mulitple_sequence_analysis.py
performance_scalability.py		performance_scalability.py
test_advanced_pattern_analysis.py		test_advanced_pattern_analysis.py
test_fasta_handlers.py		test_fasta_handlers.py
test_genomic_db.py		test_genomic_db.py
test_multiple_sequence_analysis.py		test_multiple_sequence_analysis.py
test_performance_scalability.py		test_performance_scalability.py
test_sequences.fasta.py		test_sequences.fasta.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Assignment 3: Genomic Databases and Advanced Applications (CCA5)

Project Overview

Setup and Installation

Prerequisites

Dependencies

About

Uh oh!

Releases

Packages

Languages

siyaagarwal2005/Algorithms-In-Bioinformatics-CCA-5

Folders and files

Latest commit

History

Repository files navigation

Assignment 3: Genomic Databases and Advanced Applications (CCA5)

Project Overview

Setup and Installation

Prerequisites

Dependencies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages