Skip to content

hammerdirt-analyst/evaluate_encounters

Repository files navigation

🧪 Simple Pipeline for Binary classifiers

Author: Roger Erismann

Purpose: Estimate and evaluate exceedance probabilities using configurable classification models. Originally developed to model plastic shotgun-wadding encounters on Lake Geneva's beaches.

Includes .qgz and .shp files to reproduce maps.

🚀 Overview

This repository provides a modular, extensible framework for evaluating the probability of a numeric target variable exceeding various thresholds (e.g., P(quantity ≥ X)).

Core Capabilities:

  • Binary classification over cumulative thresholds
  • Model tuning and comparison using scikit-learn-compatible classifier
  • Evaluation across full threshold ranges
  • Region-level or group-based summarization
  • Automated saving of results in .csv and .json formats

📁 Modules

classifiers.py

Contains the core logic and ClassifierPipeline class which handles:

  • Data preprocessing
  • Model tuning
  • Best-model selection
  • Threshold-based evaluation
  • Output summarization and export

The pipeline is driven by a configuration dictionary. See Usage below.

⚙️ Configuration Example

The pipeline is configured using a dictionary. Here's a simplified example:

config = {
    "task": "classification",
    "target_column": "quantity",
    "columns": ["region"],
    "categorical_cols": ["region"],
    "numeric_cols": [],
    "summary_column": "region",
    "split": {
        "method": "date",
        "date_column": "date",
        "date_split": "2022-01-01"
    },
    "split_name": "date_split",
    "thresholds": [1, 2, 3],
    "threshold_step": 1.0,
    "model_defs": classifiers, # see below
    "model_classes": model_classes, # see below
    "selection_metric": {
        "method": "mean",
        "columns": ["1", "2", "3"],
        "maximize": True
    },
    "output_dir": "data/test_results"
}
# model defs are the sklearn models you want to use, and the parameters you want to try
model_defs = {
    LogisticRegression.__name__: {
        "model": LogisticRegression(max_iter=1000, class_weight="balanced"),
        "param_grid": {"clf__C": [0.01, 0.1, 1, 10]}
    },
    MultinomialNB.__name__: {
        "model": MultinomialNB(),
        "param_grid": {"clf__alpha": [0.1, 1.0, 5.0]}
    },
    RandomForestClassifier.__name__: {
        "model": RandomForestClassifier(n_jobs=-1, class_weight="balanced"),
        "param_grid": {
            "clf__n_estimators": [100],
            "clf__max_depth": [4, 8, None]
        }
    },
    XGBClassifier.__name__: {
        "model": XGBClassifier(
            use_label_encoder=False,
            eval_metric="logloss",
            n_jobs=-1,
            verbosity=0
        ),
        "param_grid": {
            "clf__n_estimators": [100],
            "clf__max_depth": [3, 6],
            "clf__scale_pos_weight": [1, 2]
        }
    }
}

# model classes maps names to methods when evaluating models on the full grid
model_classes = {
    LogisticRegression.__name__: LogisticRegression,
    RandomForestClassifier.__name__: RandomForestClassifier,
    XGBClassifier.__name__: XGBClassifier,
    MultinomialNB.__name__: MultinomialNB
}

💠 Usage

Running a Custom Task

from classifiers import csv_to_dataframe, ClassifierPipeline
from my_config import config  # Define your config

df = csv_to_dataframe("path/to/data.csv")
pipeline = ClassifierPipeline(df, config)
pipeline.run()

Running All Predefined Evaluations

evaluate_encounters.py

Defines and runs multiple classification tasks using different targets and split strategies (e.g., by date or randomly). It demonstrates:

  • How to build and run multiple ClassifierPipeline configurations
  • How to evaluate models on both raw count and rate targets

The original use case for this pipeline

from evaluate_encounters import evaluate_encounters

evaluate_encounters()

📄 Output

The pipeline produces:

  • *_summary.csv and .json — Threshold performance summaries
  • *_test_predictions.csv — Full prediction outputs with probabilities
  • Logs in logs/classifiers.log (configurable)

All filenames include:

  • the split strategy (split_name)
  • the target column
  • the output type (e.g. summary, tuning_summary)

📦 Dependencies

  • Python 3.10+
  • pandas
  • numpy
  • scikit-learn
  • xgboost-cpu

Utilities

error_utilities.py

Provides utilities for error handling:

  • handle_error: Logs and returns structured error messages
  • handle_errors: A decorator to apply error handling consistently across pipeline functions

logging_config.py

Initializes and configures loggers for clean, centralized logging across all modules:

  • Supports both file and console logging
  • Prevents duplicate handlers
  • Automatically ensures log directories exist

contact: roger@hammerdirt.ch

About

An analysis of event probabilities using binary classification.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages