Welcome to ETL Problems, an open-source project designed for learning, experimenting, and contributing to real-world data engineering workflows.
This repository contains a deliberately broken ETL pipeline that mimics issues data engineers face daily. The goal is for contributors to identify, fix, and enhance the pipeline — while learning best practices in data extraction, transformation, and loading.
The pipeline follows a simple ETL flow:
- Extract → Reads data from a CSV file (with encoding fallback).
- Transform → Cleans, deduplicates, and prepares the dataset.
- Load → Stores processed data into an SQLite database (with idempotency).
These bugs are intentionally introduced and marked in the code with
# TODO (Find & Fix): ...
Contributors should search for these comments and fix the issues.
- Unused imports
- Incorrect default values
- Wrong file extension checks
- Missing error handling
- Print statements instead of logging
- Missing idempotency in database load
- No duplicate removal in transform
- Missing actual logic in extract/transform/load steps
- Fix bugs marked with
# TODO (Find & Fix): ... - Improve error handling and logging
- Add tests and validation
- Enhance documentation
- Add new features (scrapers, data quality checks, visualizations)
Clone the repo and install dependencies:
git clone https://github.com/<your-username>/etl-problems.git
cd etl-problems
pip install -r app/requirements.txt
python -m app.mainContainerize the ETL pipeline for a consistent, isolated development environment across all machines.
Prerequisites: Docker and Docker Compose must be installed on your system.
Quick Start:
- Clone the repository:
git clone https://github.com/<your-username>/etl-problems.git
cd etl-problems- Run the pipeline in a container:
docker-compose upThis command will:
- Build the Docker image from the provided
Dockerfile - Start the ETL pipeline in an isolated container
- Mount your local code directory as a volume, so changes you make to the code are immediately reflected in the container
Rebuilding the Image:
If you update dependencies in requirements.txt, rebuild the image:
docker-compose up --buildInteractive Mode:
To run commands interactively inside the container:
docker-compose run etl bashThen inside the container, you can run:
python -m app.main
python -m pytest tests/Stopping the Container:
docker-compose downBenefits:
- 🎯 Consistency: Same environment for all developers (Python 3.10, all dependencies)
- 📦 Isolation: No conflicts with local Python installations
- 🚀 Reproducibility: Works the same on Windows, macOS, and Linux
- 🔧 Hot Reload: Code changes are immediately reflected without rebuilding
- 🧪 Testing: Run tests in an isolated environment
Unit tests can be added in the tests/ folder.
Run them with:
pytest tests/- Search for
# TODO (Find & Fix): ...in the codebase. - Check the Issues for tasks and guidance.
- If you find a new bug, open an issue and suggest a fix.
- All contributions, big or small, are welcome!
Open an issue or start a discussion in the repo. Happy hacking!