Skip to content

nvc97/simple-bigdataplatform-ingestion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Platform - Kafka + MongoDB Integration

A scalable distributed data ingestion and storage platform. This system simulates streamming real-time taxi trip data through Apache Kafka into a replicated MongoDB cluster, demonstrating key big data architecture patterns and high-availability design.

📋 Overview

This project implements a distributed data platform for high-velocity data streams. Using Apache Kafka for message brokering and MongoDB for persistent storage, it demonstrates critical big data concepts: distributed computing, fault tolerance, and horizontal scalability.

The system simulates taxi trip data flowing from producers → Kafka topics → consumer applications → replicated MongoDB cluster. Performance-tested under concurrent load (1-5 consumers), it reveals throughput bottlenecks and scalability patterns while maintaining high availability even during node failures.

🏗️ Architecture

The platform consists of three core components:

  • mysimbdp-dataingest: Apache Kafka cluster with Python producer/consumer for streaming data ingestion
  • mysimbdp-coredms: MongoDB replica set (3 nodes) for distributed data storage with automatic failover
  • mysimbdp-daas: MongoDB Compass GUI for data interaction and monitoring

✨ Key Features

  • Distributed Architecture: 3-node MongoDB replica set preventing single points of failure
  • Real-time Ingestion: Kafka-based streaming with configurable batch processing (tested with 1-5 concurrent consumers)
  • High Availability: Automatic failover and data replication across all nodes
  • Performance Tested: Sustained ~1000 records/second insertion rate with latency monitoring
  • Production-Ready: Docker containerized deployment with comprehensive monitoring

📊 Dataset

Chicago Taxi Trips data with 24 fields including timestamps, coordinates, fares, and trip metrics. Processed as JSON documents in MongoDB with automatic indexing.

Dataset link: Chicago Taxi Trips Dataset

📁 Project Structure

code/
├── kafka_producer.py      # CSV → Kafka streaming producer
├── kafka_consumer.py      # Kafka → MongoDB batch consumer
├── docker-compose.yml     # MongoDB replica set definition
├── requirements.txt       # Python dependencies
└── scripts/
    └── init-cluster.sh    # Replica set initialization
data/
├── data.csv               # Sample taxi trips dataset
reports/
├── Assignment-1-Report.md        # Detailed design & implementation
└── Assignment-1-Deployment.md    # Setup & deployment guide

🚀 Quick Start

# Install dependencies
pip install -r requirements.txt

# Start MongoDB cluster
docker-compose up -d
docker exec -it mongo1 bash /scripts/init-cluster.sh

# Create Kafka topic
kafka-topics.sh --create --bootstrap-server localhost:9092 \
  --replication-factor 3 --partitions 3 --topic taxi-trips

# Run producer and consumer
python3 kafka_producer.py -i ../data/data.csv -c 10 -t taxi-trips
python3 kafka_consumer.py -t taxi-trips -g c_group_1 -id consumer1

📈 Performance Results

Consumers Execution Time Insertion Rate Status
1 11 min ~1000 rec/s Consistent
3 16 min ~1000 rec/s Good
5 24 min ~1000 rec/s Degraded

Performance constraints identified: CPU-bound at high concurrency; scalability achievable through node/partition scaling.

🎯 Key Achievements

  1. Designed and deployed a production-grade distributed data platform
  2. Implemented high-availability architecture with automatic failover
  3. Built streaming ETL pipeline handling real-time data ingestion
  4. Conducted performance testing under concurrent load (1-5 consumers)
  5. Containerized infrastructure for reproducible deployment
  6. Identified scalability bottlenecks and proposed solutions

📄 Reports

🛠️ Tech Stack

  • Streaming: Apache Kafka (distributed message broker)
  • Database: MongoDB
  • Language: Python 3
  • Containerization: Docker & Docker Compose
  • Monitoring: Kafka metrics, MongoDB Compass

💡 Learning Outcomes

This project demonstrates expertise in:

  • Big data infrastructure design and deployment
  • Distributed systems (replication, failover, consensus)
  • Real-time data pipelines and stream processing
  • Performance testing and bottleneck analysis
  • Infrastructure-as-code (Docker Compose)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published