Skip to content

sai21-learn/snorlax

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Snorlax 🐻💤 - Automated Root Cause Analysis

The Problem: Reading logs at 2 AM sucks. In microservice architectures, a single failure cascades into thousands of logs across services. Manually correlating trace IDs and finding "patient zero" takes hours of high-stress manual work.

The Solution: Snorlax is an intelligent RCA engine that correlates traces in real-time, uses NLP Clustering to isolate the root failure from the "noise" logs, and synthesizes a human-readable diagnosis using Generative AI.


⚡ Enterprise-Ready Refactor (v2.0)

Project Snorlax has been upgraded from a synchronous monolith to a highly scalable, distributed system capable of handling high-volume production firehoses.

Key Architectural Improvements:

  1. Non-Blocking Ingestion: The POST /api/logs endpoint now buffers incoming logs into a Redis Stream and returns immediately, decoupling the API response time from database latency.
  2. Asynchronous DB & Cache: Replaced psycopg2 with asyncpg and implemented a connection pool for non-blocking database queries. Switched all Redis operations to redis-py's async client.
  3. Distributed ML Pipeline: Heavy CPU-bound tasks (Sentence-Transformer embeddings, DBSCAN clustering) and LLM explanation calls have been moved to a dedicated Celery Worker. This ensures the API remains responsive during complex analysis.
  4. TimescaleDB Integration: Upgraded the database to TimescaleDB and enabled Hypertables for logs, significantly improving time-series insert performance and query efficiency at scale.
  5. Bulk Ingestion Worker: A new specialized worker, worker_ingestion.py, consumes logs from the Redis buffer in batches and performs bulk insertions, ensuring zero-drop reliability during traffic spikes.

🚀 Getting Started

1. Prerequisites

  • Docker & Docker Compose
  • Node.js & npm (for local frontend dev)
  • An OpenRouter API Key

2. Environment Setup

cd backend && cp .envexample .env
# Edit .env and add your OPENROUTER_API_KEY

3. Start the Pipeline

docker compose up -d --build

This starts the full enterprise stack:

  • FastAPI API: Async-first log ingestion.
  • TimescaleDB: High-performance log storage.
  • Redis: Stream buffer and Celery broker.
  • Celery Worker: Background NLP/LLM tasks.
  • Ingestion Worker: Log buffer consumer.

4. Start the Dashboard

cd frontend && npm install && npm run dev

🧠 AI Pipeline & Technical Complexity

To meet the AI Track requirements, Snorlax implements a sophisticated multi-stage pipeline instead of just passing raw logs to an LLM:

1. Semantic Embedding

We use the all-MiniLM-L6-v2 Sentence-Transformer to convert log messages into high-dimensional vectors. This allows Snorlax to understand meaning (e.g., "Connection refused" vs "Failed to connect") rather than just text matching.

2. DBSCAN Clustering (Isolation)

We run the DBSCAN algorithm on the log vectors in a separate background process.

  • The Innovation: By clustering log messages spatially, we can separate "Foundational Errors" from "Impact Logs" (the downstream noise).
  • The Result: We identify the specific service and error cluster that triggered the cascade chronologically.

3. Generative Synthesis (LLM)

Only the isolated cluster and immediately preceding context are sent to the LLM. This reduces token noise by ~80% and significantly improves the pinpoint accuracy of the final diagnosis.


🏗 Architecture & Pipeline

graph TD
    Service[Microservice] -->|POST /api/logs| API[FastAPI API]
    API -->|XADD| Stream[Redis Stream Log Buffer]
    Stream -->|XREADGROUP| IW[Ingestion Worker]
    IW -->|Bulk Insert| DB[(TimescaleDB)]
    API -->|delay| Broker[Redis Broker]
    Broker -->|run| CW[Celery Worker]
    CW -->|NLP/LLM| CW
    CW -->|Update Incident| DB
    DB -->|Live Update| Dash[React Dashboard]
Loading

⚖️ Why Snorlax? (Architecture vs. Wrappers)

Snorlax is a production-grade engineering system, not an ad-hoc AI wrapper. By implementing a systematic pipeline with background task offloading and stream buffering, we solve the "Log Spam" problem that overwhelms general-purpose AI tools.

Feature Snorlax (Enterprise) AI CLI (Tool)
Ingestion Buffered: Redis Streams handle high-volume spikes Manual: Limited by shell/copy-paste
Detection Proactive: Triggers automatically on error thresholds Reactive: Requires human to start analysis
Processing Distributed: Async workers handle heavy ML tasks Serial: Blocks local terminal or event loop
Noise Filtering DBSCAN Clustering: Mathematically isolates root cause Ad-hoc: Overwhelms LLM with "Ripple" noise
Scaling TimescaleDB: Optimized for millions of rows Memory/Local: Fails on large datasets

🔌 Integration Guide

To integrate your existing microservices with Snorlax, you simply need to POST your logs to the Snorlax ingestion endpoint.

1. Ingestion Endpoint

URL: http://<snorlax-ip>:8000/api/logs
Method: POST
Headers: Content-Type: application/json

2. Log Payload Structure

Snorlax expects the following JSON schema:

{
  "timestamp": "2026-04-11T10:45:00Z",
  "service": "your-service-name",
  "level": "ERROR",
  "request_id": "unique-correlation-id",
  "message": "The actual error message",
  "stack_trace": "Optional stack trace string",
  "context": {
    "any": "additional",
    "metadata": "here"
  }
}

3. Implementation Examples

Python (Official SDK)

Copy sdk/python/snorlax.py into your project:

from snorlax import SnorlaxClient

snorlax = SnorlaxClient(endpoint="http://localhost:8000")
snorlax.log("Connection Timeout", service="auth-service", request_id="req-123")

Node.js (Axios)

const axios = require('axios');

async function logToSnorlax(msg, service, requestId) {
    await axios.post('http://localhost:8000/api/logs', {
        timestamp: new Date().toISOString(),
        service: service,
        level: 'ERROR',
        request_id: requestId || 'gen-' + Math.random(),
        message: msg
    });
}

4. One-Command Integration Test 🧪

Run this in your terminal to verify your connection to Snorlax:

curl -X POST http://localhost:8000/api/logs \
     -H "Content-Type: application/json" \
     -d '{"timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","service":"test-cli","level":"ERROR","request_id":"test-123","message":"Connection Refused"}'

5. One-Click Quick Test (UI) 🚀

For a full end-to-end demonstration of the RCA pipeline:

  1. Open the Dashboard.
  2. Scroll to the System Diagnostics & Quick Test section at the bottom.
  3. Click Run Incident Simulation.
  4. Snorlax will automatically ingest a correlated error chain, run the NLP clustering, and display the final diagnosis in seconds.

6. Advanced Incident Scenarios (CLI) 🛠️

To test Snorlax against a variety of complex real-world scenarios (Memory Leaks, Configuration Cascades, External API Failures, etc.):

  1. Ensure the system is running: docker compose up.
  2. Execute the test suite logic:
    python scripts/test_incidents.py
  3. This script simulates 5 distinct high-fidelity incident bundles and allows you to ingest them into the pipeline for validation.

Tip

Trace Correlation: Ensure all services involved in a single request use the same request_id. This allows Snorlax to stitch the failure chain together for RCA.


❓ Frequently Asked Questions (FAQ)

1. What is Snorlax's core mission?

Imagine a high-pressure restaurant kitchen where multiple chefs (microservices) work together. If a dish is ruined, everyone starts yelling—these are your Error Logs. Traditionally, a developer must read through thousands of lines of everyone yelling to find the one chef who actually made the first mistake. Snorlax is an automated detective. It listens to all services, filters out the noisy "ripple effect" errors, identifies Patient Zero (the root cause), and provides a clear, one-sentence diagnosis and fix.

2. How does it collect errors from a distributed system?

Snorlax acts as a high-speed central mailbox. Every microservice (Auth, Payment, DB, etc.) is configured to send a copy of its error messages to the POST /api/logs endpoint. These logs use a structured JSON format containing a Timestamp, Service Name, Message, and a Request ID (correlation ID) to track the specific user action.

3. Why not just paste logs into ChatGPT?

Sending 10,000 raw logs to an LLM is slow, expensive, and often results in "hallucinations" because the AI gets overwhelmed by noise. Snorlax’s innovation is our Anti-Noise Strategy:

  1. Mathematical Isolation: We use local, lightweight ML (Sentence-Transformers + DBSCAN clustering) to isolate the "Patient Zero" error cluster mathematically first.
  2. Targeted Synthesis: We only send the isolated root cause to the expensive LLM. This is a real engineering pipeline designed for accuracy and cost-efficiency, not just an AI wrapper.

4. How does Snorlax handle massive scale?

Our architecture is built with enterprise-grade "shock absorbers":

  • Redis Streams: Acts as a high-speed buffer. Even if 100,000 logs arrive in a second, the API instantly pushes them to a stream and returns, ensuring zero-drop ingestion.
  • The Ingestion Worker: A dedicated worker_ingestion.py consumes the stream and performs bulk inserts into the database, preventing DB bottlenecks.
  • The Split Brain: We use Celery to separate the web server from the AI math. Heavy clustering and LLM calls happen in the background so the dashboard stays fast.
  • TimescaleDB: We swapped standard Postgres for TimescaleDB, which is natively designed to handle billions of time-stamped logs.

5. How does it trace errors through a chain of services?

Through Trace Correlation. If a "Database Down" error in one service causes a "500 Internal Error" in another, both will share the same request_id. Snorlax groups these together, sorts them by precise millisecond timestamps, and realizes the downstream errors are just secondary noise, leaving the Database error as the clear root cause.

6. Does Snorlax access or fix my code automatically?

No. Snorlax is a Diagnostic Tool. It reads the stack traces included in the error logs to identify the exact file and line number where the failure occurred. Automatically rewriting and deploying code during a crash is dangerous. Snorlax focuses on handing the developer the "smoking gun" so they can make a safe, informed fix in seconds.


🛠 Tech Stack

  • AI: Sentence-Transformers, SciKit-Learn (DBSCAN), OpenRouter (LLMs)
  • Backend: FastAPI, AsyncOpenAI, asyncpg, Celery
  • Data: TimescaleDB (Postgres 15), Redis 7 (Streams)
  • Frontend: React 18, Recharts, Lucide, Pure CSS

Built for the HACK X Hackathon 2026. 🐻💤

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors