Production-grade URL Shortener built for resilience, observability, and scale. Engineered for the MLH Production Engineering Quest with full incident response and reliability testing.
Get Vybe up and running in 30 seconds:
git clone https://github.com/Invariants0/Vybe.git && cd Vybe
just dev-upNote
Make sure you have overmind and just installed
Visit http://localhost to access the dashboard.
API Docs: http://localhost/api/v1/docs
Monitoring: Grafana at http://localhost:3001 (admin/admin)
- ⚡ Blazing Fast: 45ms P95 latency at 500 RPS.
- 🛡️ Built for Failure: Resilient to DB/Cache outages and container crashes.
- 👁️ Full Visibility: Prometheus metrics, Grafana dashboards, and structured JSON logs.
- 🧪 Battle Tested: 7 verified failure scenarios and automated integration tests.
- 📖 Operator-First: Comprehensive runbooks, architecture guides, and capacity plans.
graph TD
User([External Request]) --> Nginx[NGINX Reverse Proxy]
Nginx --> App1[Flask App Instance 1]
Nginx --> App2[Flask App Instance 2]
App1 --> DB[(PostgreSQL)]
App2 --> DB
App1 -.-> Cache((Redis))
App2 -.-> Cache
subgraph Observability
App1 & App2 & DB --> Prom[Prometheus]
Prom --> Graf[Grafana + AlertManager]
end
👨💻 For Developers
Understand the codebase and start contributing:
- Quick Start Index - Orientation (5m)
- Architecture Guide - Deep dive (15m)
- API Reference - All 18 endpoints
- Local Dev Setup - Environment config
🛠️ For DevOps / SRE
Operational guidance for production:
- Deployment Guide - Local to Cloud
- Config Reference - Env var tuning
- Capacity Plan - Scaling limits
- Runbooks - Incident procedures
🚨 For On-Call Engineers
Fast response when things break:
- Incident Runbooks - Step-by-step fixes
- Troubleshooting Guide - Root cause diagnosis
- Alert Definitions - What each alert means
📑 Complete Documentation Index
| Document | Audience | Time |
|---|---|---|
| Architecture | Engineers | 40m |
| API Reference | Devs | 20m |
| Deployment | SRE | 30m |
| Troubleshooting | On-Call | 25m |
| Runbooks | On-Call | 30m |
| Decision Log | Architects | 20m |
- Prepare Environment:
uv sync cp backend/.env.example backend/.env
- Start Dependencies:
docker compose up -d db redis
- Init & Run:
uv run python scripts/init_db.py uv run python run.py
Important
Ensure your .env contains the correct database credentials. The system uses specific passwords for Redis and Grafana by default (see docker-compose.yml).
Tip
To optimize performance for 500+ RPS, ensure Redis is healthy and REDIS_CACHE_ENABLED is set to true.
📁 Project Structure
backend/ # Flask API, SQL Alchemy models, Business logic
frontend/ # Next.js Dashboard and UI
infra/ # Nginx configs, Dockerfiles
monitoring/ # Prometheus & Grafana provisioning
scripts/ # DB init, Chaos testing, Load testing
tests/ # Unit and Integration test suites
📊 Observability & Monitoring
Vybe tracks RED metrics (Rate, Errors, Duration) for every request.
| Alert | Trigger | Recovery Action |
|---|---|---|
| Instance Down | 0 healthy targets | Auto-failover by Nginx |
| High Error Rate | >5% failures | Log analysis via Grafana |
| P95 Latency | >1s duration | Scale app or check DB load |
| DB Pool Exhaust | >90% usage | Increase DB_POOL_SIZE |
🧪 Testing & Coverage
uv run pytest tests/ --cov=backend # All tests
uv run pytest tests/unit/ # Unit only[!WARNING]
Current coverage is 45%. Priority: Increase coverage forurl_serviceandauthmodules.
⚠️ Safety & Best Practices
[!CAUTION]
Never perform manualDELETEoperations on the PostgreSQLurlstable in production. This causes cache inconsistency. Use theSCRUB_DATAAPI endpoint instead.
- Bronze: Architecture & Data Flow documented.
- Silver: 45-min Deployment guide & Config reference complete.
- Gold: 7 failure scenarios tested and documented in Runbooks.
Incident Verification (Apr 5, 2026): Tested: Database Down, CPU Spike, Redis Loss, High Error Rate. Result: 100% Resilience Success.
- Redis Cluster: High availability for caching.
- Read Replicas: Scale to 2000+ RPS.
- Rate Limiting: Per-IP/User throttling.
- Custom Domains: Enterprise link support.
Is this production-grade?
Yes. It includes gracefully handling failures, health checks, connection pooling, and automated failover.How do I test the resilience?
Run `bash scripts/chaos.sh` to simulate failures and watch Grafana alerts trigger.- Emergency? Check Runbooks.
- Issues? Raise a GitHub Issue.
- License: Apache 2.0 (Commercial use allowed).
Maintained for Production Performance (April 2026)
