🧪 Simple Pipeline for Binary classifiers

Author: Roger Erismann

Purpose: Estimate and evaluate exceedance probabilities using configurable classification models. Originally developed to model plastic shotgun-wadding encounters on Lake Geneva's beaches.

Includes .qgz and .shp files to reproduce maps.

🚀 Overview

This repository provides a modular, extensible framework for evaluating the probability of a numeric target variable exceeding various thresholds (e.g., P(quantity ≥ X)).

Core Capabilities:

Binary classification over cumulative thresholds
Model tuning and comparison using scikit-learn-compatible classifier
Evaluation across full threshold ranges
Region-level or group-based summarization
Automated saving of results in .csv and .json formats

📁 Modules

`classifiers.py`

Contains the core logic and ClassifierPipeline class which handles:

Data preprocessing
Model tuning
Best-model selection
Threshold-based evaluation
Output summarization and export

The pipeline is driven by a configuration dictionary. See Usage below.

⚙️ Configuration Example

The pipeline is configured using a dictionary. Here's a simplified example:

config = {
    "task": "classification",
    "target_column": "quantity",
    "columns": ["region"],
    "categorical_cols": ["region"],
    "numeric_cols": [],
    "summary_column": "region",
    "split": {
        "method": "date",
        "date_column": "date",
        "date_split": "2022-01-01"
    },
    "split_name": "date_split",
    "thresholds": [1, 2, 3],
    "threshold_step": 1.0,
    "model_defs": classifiers, # see below
    "model_classes": model_classes, # see below
    "selection_metric": {
        "method": "mean",
        "columns": ["1", "2", "3"],
        "maximize": True
    },
    "output_dir": "data/test_results"
}
# model defs are the sklearn models you want to use, and the parameters you want to try
model_defs = {
    LogisticRegression.__name__: {
        "model": LogisticRegression(max_iter=1000, class_weight="balanced"),
        "param_grid": {"clf__C": [0.01, 0.1, 1, 10]}
    },
    MultinomialNB.__name__: {
        "model": MultinomialNB(),
        "param_grid": {"clf__alpha": [0.1, 1.0, 5.0]}
    },
    RandomForestClassifier.__name__: {
        "model": RandomForestClassifier(n_jobs=-1, class_weight="balanced"),
        "param_grid": {
            "clf__n_estimators": [100],
            "clf__max_depth": [4, 8, None]
        }
    },
    XGBClassifier.__name__: {
        "model": XGBClassifier(
            use_label_encoder=False,
            eval_metric="logloss",
            n_jobs=-1,
            verbosity=0
        ),
        "param_grid": {
            "clf__n_estimators": [100],
            "clf__max_depth": [3, 6],
            "clf__scale_pos_weight": [1, 2]
        }
    }
}

# model classes maps names to methods when evaluating models on the full grid
model_classes = {
    LogisticRegression.__name__: LogisticRegression,
    RandomForestClassifier.__name__: RandomForestClassifier,
    XGBClassifier.__name__: XGBClassifier,
    MultinomialNB.__name__: MultinomialNB
}

💠 Usage

Running a Custom Task

from classifiers import csv_to_dataframe, ClassifierPipeline
from my_config import config  # Define your config

df = csv_to_dataframe("path/to/data.csv")
pipeline = ClassifierPipeline(df, config)
pipeline.run()

Running All Predefined Evaluations

`evaluate_encounters.py`

Defines and runs multiple classification tasks using different targets and split strategies (e.g., by date or randomly). It demonstrates:

How to build and run multiple ClassifierPipeline configurations
How to evaluate models on both raw count and rate targets

The original use case for this pipeline

from evaluate_encounters import evaluate_encounters

evaluate_encounters()

📄 Output

The pipeline produces:

*_summary.csv and .json — Threshold performance summaries
*_test_predictions.csv — Full prediction outputs with probabilities
Logs in logs/classifiers.log (configurable)

All filenames include:

the split strategy (split_name)
the target column
the output type (e.g. summary, tuning_summary)

📦 Dependencies

Python 3.10+
pandas
numpy
scikit-learn
xgboost-cpu

Utilities

`error_utilities.py`

Provides utilities for error handling:

handle_error: Logs and returns structured error messages
handle_errors: A decorator to apply error handling consistently across pipeline functions

`logging_config.py`

Initializes and configures loggers for clean, centralized logging across all modules:

Supports both file and console logging
Prevents duplicate handlers
Automatically ensures log directories exist

contact: roger@hammerdirt.ch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 Simple Pipeline for Binary classifiers

🚀 Overview

Core Capabilities:

📁 Modules

`classifiers.py`

⚙️ Configuration Example

💠 Usage

Running a Custom Task

Running All Predefined Evaluations

`evaluate_encounters.py`

📄 Output

📦 Dependencies

Utilities

`error_utilities.py`

`logging_config.py`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
gis		gis
logs		logs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
classifiers.py		classifiers.py
environment.yml		environment.yml
error_utilities.py		error_utilities.py
evaluate_encounters.py		evaluate_encounters.py
logging_config.py		logging_config.py

Folders and files

Latest commit

History

Repository files navigation

🧪 Simple Pipeline for Binary classifiers

🚀 Overview

Core Capabilities:

📁 Modules

classifiers.py

⚙️ Configuration Example

💠 Usage

Running a Custom Task

Running All Predefined Evaluations

evaluate_encounters.py

📄 Output

📦 Dependencies

Utilities

error_utilities.py

logging_config.py

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`classifiers.py`

`evaluate_encounters.py`

`error_utilities.py`

`logging_config.py`

Packages