Author: Roger Erismann
Purpose: Estimate and evaluate exceedance probabilities using configurable classification models. Originally developed to model plastic shotgun-wadding encounters on Lake Geneva's beaches.
Includes .qgz and .shp files to reproduce maps.
This repository provides a modular, extensible framework for evaluating the probability of a numeric target variable exceeding various thresholds (e.g., P(quantity ≥ X)).
- Binary classification over cumulative thresholds
- Model tuning and comparison using scikit-learn-compatible classifier
- Evaluation across full threshold ranges
- Region-level or group-based summarization
- Automated saving of results in
.csvand.jsonformats
Contains the core logic and ClassifierPipeline class which handles:
- Data preprocessing
- Model tuning
- Best-model selection
- Threshold-based evaluation
- Output summarization and export
The pipeline is driven by a configuration dictionary. See Usage below.
The pipeline is configured using a dictionary. Here's a simplified example:
config = {
"task": "classification",
"target_column": "quantity",
"columns": ["region"],
"categorical_cols": ["region"],
"numeric_cols": [],
"summary_column": "region",
"split": {
"method": "date",
"date_column": "date",
"date_split": "2022-01-01"
},
"split_name": "date_split",
"thresholds": [1, 2, 3],
"threshold_step": 1.0,
"model_defs": classifiers, # see below
"model_classes": model_classes, # see below
"selection_metric": {
"method": "mean",
"columns": ["1", "2", "3"],
"maximize": True
},
"output_dir": "data/test_results"
}
# model defs are the sklearn models you want to use, and the parameters you want to try
model_defs = {
LogisticRegression.__name__: {
"model": LogisticRegression(max_iter=1000, class_weight="balanced"),
"param_grid": {"clf__C": [0.01, 0.1, 1, 10]}
},
MultinomialNB.__name__: {
"model": MultinomialNB(),
"param_grid": {"clf__alpha": [0.1, 1.0, 5.0]}
},
RandomForestClassifier.__name__: {
"model": RandomForestClassifier(n_jobs=-1, class_weight="balanced"),
"param_grid": {
"clf__n_estimators": [100],
"clf__max_depth": [4, 8, None]
}
},
XGBClassifier.__name__: {
"model": XGBClassifier(
use_label_encoder=False,
eval_metric="logloss",
n_jobs=-1,
verbosity=0
),
"param_grid": {
"clf__n_estimators": [100],
"clf__max_depth": [3, 6],
"clf__scale_pos_weight": [1, 2]
}
}
}
# model classes maps names to methods when evaluating models on the full grid
model_classes = {
LogisticRegression.__name__: LogisticRegression,
RandomForestClassifier.__name__: RandomForestClassifier,
XGBClassifier.__name__: XGBClassifier,
MultinomialNB.__name__: MultinomialNB
}from classifiers import csv_to_dataframe, ClassifierPipeline
from my_config import config # Define your config
df = csv_to_dataframe("path/to/data.csv")
pipeline = ClassifierPipeline(df, config)
pipeline.run()Defines and runs multiple classification tasks using different targets and split strategies (e.g., by date or randomly). It demonstrates:
- How to build and run multiple
ClassifierPipelineconfigurations - How to evaluate models on both raw count and rate targets
The original use case for this pipeline
from evaluate_encounters import evaluate_encounters
evaluate_encounters()The pipeline produces:
*_summary.csvand.json— Threshold performance summaries*_test_predictions.csv— Full prediction outputs with probabilities- Logs in
logs/classifiers.log(configurable)
All filenames include:
- the split strategy (
split_name) - the target column
- the output type (e.g.
summary,tuning_summary)
- Python 3.10+
- pandas
- numpy
- scikit-learn
- xgboost-cpu
Provides utilities for error handling:
handle_error: Logs and returns structured error messageshandle_errors: A decorator to apply error handling consistently across pipeline functions
Initializes and configures loggers for clean, centralized logging across all modules:
- Supports both file and console logging
- Prevents duplicate handlers
- Automatically ensures log directories exist
contact: roger@hammerdirt.ch