Big Data Platform - Kafka + MongoDB Integration

A scalable distributed data ingestion and storage platform. This system simulates streamming real-time taxi trip data through Apache Kafka into a replicated MongoDB cluster, demonstrating key big data architecture patterns and high-availability design.

📋 Overview

This project implements a distributed data platform for high-velocity data streams. Using Apache Kafka for message brokering and MongoDB for persistent storage, it demonstrates critical big data concepts: distributed computing, fault tolerance, and horizontal scalability.

The system simulates taxi trip data flowing from producers → Kafka topics → consumer applications → replicated MongoDB cluster. Performance-tested under concurrent load (1-5 consumers), it reveals throughput bottlenecks and scalability patterns while maintaining high availability even during node failures.

🏗️ Architecture

The platform consists of three core components:

mysimbdp-dataingest: Apache Kafka cluster with Python producer/consumer for streaming data ingestion
mysimbdp-coredms: MongoDB replica set (3 nodes) for distributed data storage with automatic failover
mysimbdp-daas: MongoDB Compass GUI for data interaction and monitoring

✨ Key Features

Distributed Architecture: 3-node MongoDB replica set preventing single points of failure
Real-time Ingestion: Kafka-based streaming with configurable batch processing (tested with 1-5 concurrent consumers)
High Availability: Automatic failover and data replication across all nodes
Performance Tested: Sustained ~1000 records/second insertion rate with latency monitoring
Production-Ready: Docker containerized deployment with comprehensive monitoring

📊 Dataset

Chicago Taxi Trips data with 24 fields including timestamps, coordinates, fares, and trip metrics. Processed as JSON documents in MongoDB with automatic indexing.

Dataset link: Chicago Taxi Trips Dataset

📁 Project Structure

code/
├── kafka_producer.py      # CSV → Kafka streaming producer
├── kafka_consumer.py      # Kafka → MongoDB batch consumer
├── docker-compose.yml     # MongoDB replica set definition
├── requirements.txt       # Python dependencies
└── scripts/
    └── init-cluster.sh    # Replica set initialization
data/
├── data.csv               # Sample taxi trips dataset
reports/
├── Assignment-1-Report.md        # Detailed design & implementation
└── Assignment-1-Deployment.md    # Setup & deployment guide

🚀 Quick Start

# Install dependencies
pip install -r requirements.txt

# Start MongoDB cluster
docker-compose up -d
docker exec -it mongo1 bash /scripts/init-cluster.sh

# Create Kafka topic
kafka-topics.sh --create --bootstrap-server localhost:9092 \
  --replication-factor 3 --partitions 3 --topic taxi-trips

# Run producer and consumer
python3 kafka_producer.py -i ../data/data.csv -c 10 -t taxi-trips
python3 kafka_consumer.py -t taxi-trips -g c_group_1 -id consumer1

📈 Performance Results

Consumers	Execution Time	Insertion Rate	Status
1	11 min	~1000 rec/s	Consistent
3	16 min	~1000 rec/s	Good
5	24 min	~1000 rec/s	Degraded

Performance constraints identified: CPU-bound at high concurrency; scalability achievable through node/partition scaling.

🎯 Key Achievements

Designed and deployed a production-grade distributed data platform
Implemented high-availability architecture with automatic failover
Built streaming ETL pipeline handling real-time data ingestion
Conducted performance testing under concurrent load (1-5 consumers)
Containerized infrastructure for reproducible deployment
Identified scalability bottlenecks and proposed solutions

📄 Reports

Design & Implementation Report: Architecture decisions, detailed implementation, and performance analysis
Deployment Guide: Step-by-step setup instructions, monitoring tools, troubleshooting

🛠️ Tech Stack

Streaming: Apache Kafka (distributed message broker)
Database: MongoDB
Language: Python 3
Containerization: Docker & Docker Compose
Monitoring: Kafka metrics, MongoDB Compass

💡 Learning Outcomes

This project demonstrates expertise in:

Big data infrastructure design and deployment
Distributed systems (replication, failover, consensus)
Real-time data pipelines and stream processing
Performance testing and bottleneck analysis
Infrastructure-as-code (Docker Compose)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code		code
data		data
logs		logs
reports		reports
README.md		README.md
assignment-git.log		assignment-git.log
selfgrading.csv		selfgrading.csv
submitter.csv		submitter.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Platform - Kafka + MongoDB Integration

📋 Overview

🏗️ Architecture

✨ Key Features

📊 Dataset

📁 Project Structure

🚀 Quick Start

📈 Performance Results

🎯 Key Achievements

📄 Reports

🛠️ Tech Stack

💡 Learning Outcomes

About

Uh oh!

Releases

Packages

Languages

nvc97/simple-bigdataplatform-ingestion

Folders and files

Latest commit

History

Repository files navigation

Big Data Platform - Kafka + MongoDB Integration

📋 Overview

🏗️ Architecture

✨ Key Features

📊 Dataset

📁 Project Structure

🚀 Quick Start

📈 Performance Results

🎯 Key Achievements

📄 Reports

🛠️ Tech Stack

💡 Learning Outcomes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages