Classify production error and log lines into categories for automatic triage and routing (e.g. to the right team or runbook).
Categories: auth · timeout · config · bug · dependency · rate-limit
Model: TF-IDF + LogisticRegression on message text. Trained on a curated dataset plus optional Loghub logs (Linux, Apache) with rule-based labels.
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/macOS
pip install -r requirements.txt-
Curated:
data/labeled_logs.csv(columns:message,category). Included in the repo. -
Optional – Loghub: Download e.g. Linux (2.25 MiB) or Apache (4.9 MiB), extract, and put the log file(s) under
data/loghub/Linux/ordata/loghub/Apache/. Then run:python scripts/process_loghub.py
This creates
data/loghub_labeled.csv. The trainer merges it with the curated CSV automatically.
python -m src.trainUses data/labeled_logs.csv (and data/loghub_labeled.csv if present). Prints accuracy, F1, per-class metrics, and confusion matrix. Saves the pipeline to models/log_classifier.joblib.
python -m src.classify "Connection timeout after 30s"Outputs suggested category and confidence. Requires a trained model (python -m src.train first).
Open notebooks/results_analysis.ipynb and run all cells to see metrics, confusion matrix, and example correct/wrong predictions with short notes on why the model fails in some cases.