A scalable distributed data ingestion and storage platform. This system simulates streamming real-time taxi trip data through Apache Kafka into a replicated MongoDB cluster, demonstrating key big data architecture patterns and high-availability design.
This project implements a distributed data platform for high-velocity data streams. Using Apache Kafka for message brokering and MongoDB for persistent storage, it demonstrates critical big data concepts: distributed computing, fault tolerance, and horizontal scalability.
The system simulates taxi trip data flowing from producers → Kafka topics → consumer applications → replicated MongoDB cluster. Performance-tested under concurrent load (1-5 consumers), it reveals throughput bottlenecks and scalability patterns while maintaining high availability even during node failures.
The platform consists of three core components:
- mysimbdp-dataingest: Apache Kafka cluster with Python producer/consumer for streaming data ingestion
- mysimbdp-coredms: MongoDB replica set (3 nodes) for distributed data storage with automatic failover
- mysimbdp-daas: MongoDB Compass GUI for data interaction and monitoring
- Distributed Architecture: 3-node MongoDB replica set preventing single points of failure
- Real-time Ingestion: Kafka-based streaming with configurable batch processing (tested with 1-5 concurrent consumers)
- High Availability: Automatic failover and data replication across all nodes
- Performance Tested: Sustained ~1000 records/second insertion rate with latency monitoring
- Production-Ready: Docker containerized deployment with comprehensive monitoring
Chicago Taxi Trips data with 24 fields including timestamps, coordinates, fares, and trip metrics. Processed as JSON documents in MongoDB with automatic indexing.
Dataset link: Chicago Taxi Trips Dataset
code/
├── kafka_producer.py # CSV → Kafka streaming producer
├── kafka_consumer.py # Kafka → MongoDB batch consumer
├── docker-compose.yml # MongoDB replica set definition
├── requirements.txt # Python dependencies
└── scripts/
└── init-cluster.sh # Replica set initialization
data/
├── data.csv # Sample taxi trips dataset
reports/
├── Assignment-1-Report.md # Detailed design & implementation
└── Assignment-1-Deployment.md # Setup & deployment guide
# Install dependencies
pip install -r requirements.txt
# Start MongoDB cluster
docker-compose up -d
docker exec -it mongo1 bash /scripts/init-cluster.sh
# Create Kafka topic
kafka-topics.sh --create --bootstrap-server localhost:9092 \
--replication-factor 3 --partitions 3 --topic taxi-trips
# Run producer and consumer
python3 kafka_producer.py -i ../data/data.csv -c 10 -t taxi-trips
python3 kafka_consumer.py -t taxi-trips -g c_group_1 -id consumer1| Consumers | Execution Time | Insertion Rate | Status |
|---|---|---|---|
| 1 | 11 min | ~1000 rec/s | Consistent |
| 3 | 16 min | ~1000 rec/s | Good |
| 5 | 24 min | ~1000 rec/s | Degraded |
Performance constraints identified: CPU-bound at high concurrency; scalability achievable through node/partition scaling.
- Designed and deployed a production-grade distributed data platform
- Implemented high-availability architecture with automatic failover
- Built streaming ETL pipeline handling real-time data ingestion
- Conducted performance testing under concurrent load (1-5 consumers)
- Containerized infrastructure for reproducible deployment
- Identified scalability bottlenecks and proposed solutions
- Design & Implementation Report: Architecture decisions, detailed implementation, and performance analysis
- Deployment Guide: Step-by-step setup instructions, monitoring tools, troubleshooting
- Streaming: Apache Kafka (distributed message broker)
- Database: MongoDB
- Language: Python 3
- Containerization: Docker & Docker Compose
- Monitoring: Kafka metrics, MongoDB Compass
This project demonstrates expertise in:
- Big data infrastructure design and deployment
- Distributed systems (replication, failover, consensus)
- Real-time data pipelines and stream processing
- Performance testing and bottleneck analysis
- Infrastructure-as-code (Docker Compose)