The Problem: Reading logs at 2 AM sucks. In microservice architectures, a single failure cascades into thousands of logs across services. Manually correlating trace IDs and finding "patient zero" takes hours of high-stress manual work.
The Solution: Snorlax is an intelligent RCA engine that correlates traces in real-time, uses NLP Clustering to isolate the root failure from the "noise" logs, and synthesizes a human-readable diagnosis using Generative AI.
Project Snorlax has been upgraded from a synchronous monolith to a highly scalable, distributed system capable of handling high-volume production firehoses.
- Non-Blocking Ingestion: The
POST /api/logsendpoint now buffers incoming logs into a Redis Stream and returns immediately, decoupling the API response time from database latency. - Asynchronous DB & Cache: Replaced
psycopg2withasyncpgand implemented a connection pool for non-blocking database queries. Switched all Redis operations toredis-py's async client. - Distributed ML Pipeline: Heavy CPU-bound tasks (Sentence-Transformer embeddings, DBSCAN clustering) and LLM explanation calls have been moved to a dedicated Celery Worker. This ensures the API remains responsive during complex analysis.
- TimescaleDB Integration: Upgraded the database to TimescaleDB and enabled Hypertables for logs, significantly improving time-series insert performance and query efficiency at scale.
- Bulk Ingestion Worker: A new specialized worker,
worker_ingestion.py, consumes logs from the Redis buffer in batches and performs bulk insertions, ensuring zero-drop reliability during traffic spikes.
- Docker & Docker Compose
- Node.js & npm (for local frontend dev)
- An OpenRouter API Key
cd backend && cp .envexample .env
# Edit .env and add your OPENROUTER_API_KEYdocker compose up -d --buildThis starts the full enterprise stack:
- FastAPI API: Async-first log ingestion.
- TimescaleDB: High-performance log storage.
- Redis: Stream buffer and Celery broker.
- Celery Worker: Background NLP/LLM tasks.
- Ingestion Worker: Log buffer consumer.
cd frontend && npm install && npm run devTo meet the AI Track requirements, Snorlax implements a sophisticated multi-stage pipeline instead of just passing raw logs to an LLM:
We use the all-MiniLM-L6-v2 Sentence-Transformer to convert log messages into high-dimensional vectors. This allows Snorlax to understand meaning (e.g., "Connection refused" vs "Failed to connect") rather than just text matching.
We run the DBSCAN algorithm on the log vectors in a separate background process.
- The Innovation: By clustering log messages spatially, we can separate "Foundational Errors" from "Impact Logs" (the downstream noise).
- The Result: We identify the specific service and error cluster that triggered the cascade chronologically.
Only the isolated cluster and immediately preceding context are sent to the LLM. This reduces token noise by ~80% and significantly improves the pinpoint accuracy of the final diagnosis.
graph TD
Service[Microservice] -->|POST /api/logs| API[FastAPI API]
API -->|XADD| Stream[Redis Stream Log Buffer]
Stream -->|XREADGROUP| IW[Ingestion Worker]
IW -->|Bulk Insert| DB[(TimescaleDB)]
API -->|delay| Broker[Redis Broker]
Broker -->|run| CW[Celery Worker]
CW -->|NLP/LLM| CW
CW -->|Update Incident| DB
DB -->|Live Update| Dash[React Dashboard]
Snorlax is a production-grade engineering system, not an ad-hoc AI wrapper. By implementing a systematic pipeline with background task offloading and stream buffering, we solve the "Log Spam" problem that overwhelms general-purpose AI tools.
| Feature | Snorlax (Enterprise) | AI CLI (Tool) |
|---|---|---|
| Ingestion | Buffered: Redis Streams handle high-volume spikes | Manual: Limited by shell/copy-paste |
| Detection | Proactive: Triggers automatically on error thresholds | Reactive: Requires human to start analysis |
| Processing | Distributed: Async workers handle heavy ML tasks | Serial: Blocks local terminal or event loop |
| Noise Filtering | DBSCAN Clustering: Mathematically isolates root cause | Ad-hoc: Overwhelms LLM with "Ripple" noise |
| Scaling | TimescaleDB: Optimized for millions of rows | Memory/Local: Fails on large datasets |
To integrate your existing microservices with Snorlax, you simply need to POST your logs to the Snorlax ingestion endpoint.
URL: http://<snorlax-ip>:8000/api/logs
Method: POST
Headers: Content-Type: application/json
Snorlax expects the following JSON schema:
{
"timestamp": "2026-04-11T10:45:00Z",
"service": "your-service-name",
"level": "ERROR",
"request_id": "unique-correlation-id",
"message": "The actual error message",
"stack_trace": "Optional stack trace string",
"context": {
"any": "additional",
"metadata": "here"
}
}Copy sdk/python/snorlax.py into your project:
from snorlax import SnorlaxClient
snorlax = SnorlaxClient(endpoint="http://localhost:8000")
snorlax.log("Connection Timeout", service="auth-service", request_id="req-123")const axios = require('axios');
async function logToSnorlax(msg, service, requestId) {
await axios.post('http://localhost:8000/api/logs', {
timestamp: new Date().toISOString(),
service: service,
level: 'ERROR',
request_id: requestId || 'gen-' + Math.random(),
message: msg
});
}Run this in your terminal to verify your connection to Snorlax:
curl -X POST http://localhost:8000/api/logs \
-H "Content-Type: application/json" \
-d '{"timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","service":"test-cli","level":"ERROR","request_id":"test-123","message":"Connection Refused"}'For a full end-to-end demonstration of the RCA pipeline:
- Open the Dashboard.
- Scroll to the System Diagnostics & Quick Test section at the bottom.
- Click Run Incident Simulation.
- Snorlax will automatically ingest a correlated error chain, run the NLP clustering, and display the final diagnosis in seconds.
To test Snorlax against a variety of complex real-world scenarios (Memory Leaks, Configuration Cascades, External API Failures, etc.):
- Ensure the system is running:
docker compose up. - Execute the test suite logic:
python scripts/test_incidents.py
- This script simulates 5 distinct high-fidelity incident bundles and allows you to ingest them into the pipeline for validation.
Tip
Trace Correlation: Ensure all services involved in a single request use the same request_id. This allows Snorlax to stitch the failure chain together for RCA.
Imagine a high-pressure restaurant kitchen where multiple chefs (microservices) work together. If a dish is ruined, everyone starts yelling—these are your Error Logs. Traditionally, a developer must read through thousands of lines of everyone yelling to find the one chef who actually made the first mistake. Snorlax is an automated detective. It listens to all services, filters out the noisy "ripple effect" errors, identifies Patient Zero (the root cause), and provides a clear, one-sentence diagnosis and fix.
Snorlax acts as a high-speed central mailbox. Every microservice (Auth, Payment, DB, etc.) is configured to send a copy of its error messages to the POST /api/logs endpoint. These logs use a structured JSON format containing a Timestamp, Service Name, Message, and a Request ID (correlation ID) to track the specific user action.
Sending 10,000 raw logs to an LLM is slow, expensive, and often results in "hallucinations" because the AI gets overwhelmed by noise. Snorlax’s innovation is our Anti-Noise Strategy:
- Mathematical Isolation: We use local, lightweight ML (Sentence-Transformers + DBSCAN clustering) to isolate the "Patient Zero" error cluster mathematically first.
- Targeted Synthesis: We only send the isolated root cause to the expensive LLM. This is a real engineering pipeline designed for accuracy and cost-efficiency, not just an AI wrapper.
Our architecture is built with enterprise-grade "shock absorbers":
- Redis Streams: Acts as a high-speed buffer. Even if 100,000 logs arrive in a second, the API instantly pushes them to a stream and returns, ensuring zero-drop ingestion.
- The Ingestion Worker: A dedicated
worker_ingestion.pyconsumes the stream and performs bulk inserts into the database, preventing DB bottlenecks. - The Split Brain: We use Celery to separate the web server from the AI math. Heavy clustering and LLM calls happen in the background so the dashboard stays fast.
- TimescaleDB: We swapped standard Postgres for TimescaleDB, which is natively designed to handle billions of time-stamped logs.
Through Trace Correlation. If a "Database Down" error in one service causes a "500 Internal Error" in another, both will share the same request_id. Snorlax groups these together, sorts them by precise millisecond timestamps, and realizes the downstream errors are just secondary noise, leaving the Database error as the clear root cause.
No. Snorlax is a Diagnostic Tool. It reads the stack traces included in the error logs to identify the exact file and line number where the failure occurred. Automatically rewriting and deploying code during a crash is dangerous. Snorlax focuses on handing the developer the "smoking gun" so they can make a safe, informed fix in seconds.
- AI: Sentence-Transformers, SciKit-Learn (DBSCAN), OpenRouter (LLMs)
- Backend: FastAPI, AsyncOpenAI, asyncpg, Celery
- Data: TimescaleDB (Postgres 15), Redis 7 (Streams)
- Frontend: React 18, Recharts, Lucide, Pure CSS
Built for the HACK X Hackathon 2026. 🐻💤