Skip to content

software-engineering-final/log-classifier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

log-classifier

Classify production error and log lines into categories for automatic triage and routing (e.g. to the right team or runbook).

Categories: auth · timeout · config · bug · dependency · rate-limit

Model: TF-IDF + LogisticRegression on message text. Trained on a curated dataset plus optional Loghub logs (Linux, Apache) with rule-based labels.


Setup

python -m venv venv
venv\Scripts\activate   # Windows
# source venv/bin/activate   # Linux/macOS
pip install -r requirements.txt

Data

  • Curated: data/labeled_logs.csv (columns: message, category). Included in the repo.

  • Optional – Loghub: Download e.g. Linux (2.25 MiB) or Apache (4.9 MiB), extract, and put the log file(s) under data/loghub/Linux/ or data/loghub/Apache/. Then run:

    python scripts/process_loghub.py

    This creates data/loghub_labeled.csv. The trainer merges it with the curated CSV automatically.

Train

python -m src.train

Uses data/labeled_logs.csv (and data/loghub_labeled.csv if present). Prints accuracy, F1, per-class metrics, and confusion matrix. Saves the pipeline to models/log_classifier.joblib.

Classify

python -m src.classify "Connection timeout after 30s"

Outputs suggested category and confidence. Requires a trained model (python -m src.train first).

Results and analysis

Open notebooks/results_analysis.ipynb and run all cells to see metrics, confusion matrix, and example correct/wrong predictions with short notes on why the model fails in some cases.

About

Classify error/log lines into categories (auth, timeout, config, bug, dependency, rate-limit) for auto-triage. TF-IDF + sklearn, trained on Loghub + curated data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 82.0%
  • Python 18.0%