Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning by davidchris · Pull Request #10 · davidchris/fafycat

davidchris · 2026-02-10T14:49:15Z

Summary

End-to-end improvement of the ML categorization pipeline, from critical bug fixes through feature engineering to hyperparameter optimization.

Bug fixes (P0)

Fix calibration data leakage: CalibratedClassifierCV was fitted on the test set, inflating reported metrics. Now uses internal CV on training data with sigmoid (Platt) scaling
Fix ensemble training on full DB: LightGBM in ensemble was querying ALL transactions instead of the training split. Refactored to accept pre-split data via fit()
Fix merchant mapper dead code: Confidence capped at min(0.95, ...) but checked > 0.95 (strict), so all 103 merchant rules never fired. Raised cap to 0.98 and changed to >=
Validate threshold API payload: Return 400 on invalid values; check response.ok in UI before showing success

Feature pipeline improvements

Dual TF-IDF/SVD features: Combined char_wb (3-5, 1000 features) + word (1-2, 500 features) vectorizers, reduced via TruncatedSVD(n_components=100) — replaces single 500-feature char vectorizer
SEPA field extraction: New SepaFieldParser extracts creditor_id, IBAN prefix, mandate ref from purpose text. Strips SEPA noise markers from TF-IDF input
Full LightGBM probabilities in ensemble: Replaced fake probability reconstruction (argmax + uniform residual) with real calibrated predict_proba() vectors
Batch ML inference: Refactored predict_with_confidence() to batch all transactions instead of per-transaction model calls

Hyperparameter tuning

Optuna tuning (100 trials, TPE sampler, MedianPruner): Systematic search over 9 LightGBM hyperparameters with precomputed features and early stopping

Infrastructure

Configurable auto-approve threshold: Settings UI slider (0.50–0.99) backed by app_settings DB table and GET/PUT /api/ml/settings endpoints
Baseline evaluation script: scripts/establish_baseline.py — 5x5 repeated stratified k-fold CV with full metrics suite (macro F1, log loss, Brier score, ECE, risk-coverage, per-category Wilson CIs)
Slim production deps: Moved 7 experiment-only packages (torch, sentence-transformers, peft, mlx-lm, optuna, matplotlib, pyarrow) to dependency groups. Production install 121 → 77 packages
ML end-to-end tests: 210-line test suite covering training, prediction, serialization, and ensemble

Results (5x5 CV, 1856 transactions, 26 categories)

Model	Before	After	Change
LightGBM	0.849 ± 0.022	0.856 ± 0.018	+0.7pp
Ensemble	0.855 ± 0.024	0.858 ± 0.018	+0.3pp
Naive Bayes	0.793 ± 0.023	0.793 ± 0.023	unchanged

Variance reduced across the board. Ensemble log loss improved 9.4% from probability fix.

Test plan

All 145 tests pass (uv run pytest)
ruff check, ruff format, ty check all pass
CI passes
5x5 CV baseline evaluation confirms improvement
Settings slider renders, persists, and validates input
Production install works without experiment dependencies

🤖 Generated with Claude Code

…hold Fix merchant mapper rules being dead code: confidence was capped at 0.95 but checks used strict > 0.95, so rules never fired. Raise cap to 0.98 and change comparisons to >=. Add configurable auto-approve threshold (default 0.95) via Settings UI slider, backed by new app_settings DB table and GET/PUT /ml/settings API endpoints. Replace all hardcoded 0.95 thresholds in batch-predict, re-predict, and upload flows. Closes #9 (related: creates issue for auto-retrain after reviews) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 368eb8c846

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-10T14:54:16Z

api/ml.py

+async def update_ml_settings(payload: dict, db: Session = Depends(get_db_session)) -> dict:
+    """Update ML settings."""
+    if "auto_approve_threshold" in payload:
+        value = float(payload["auto_approve_threshold"])


Handle invalid threshold payloads with a client error

update_ml_settings casts payload["auto_approve_threshold"] with float(...) but does not catch ValueError/TypeError, so requests like {"auto_approve_threshold": "abc"} or null raise an unhandled exception and return 500 instead of a validation 4xx. This makes malformed client input look like a server outage and breaks callers that rely on proper client-error semantics.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-10T14:54:17Z

web/pages/settings_page.py

+            .then(r => r.json())
+            .then(data => {
+                window._autoApproveThreshold = parseFloat(data.auto_approve_threshold);
+                status.textContent = 'Saved! (' + parseFloat(data.auto_approve_threshold).toFixed(2) + ')';


Treat non-OK save responses as failures in threshold UI

The save flow always parses JSON and enters the success branch without checking response.ok, so a 4xx/5xx response from /api/ml/settings is still rendered as success. In those cases data.auto_approve_threshold is typically absent, producing NaN and showing a misleading Saved! (NaN) status even though the setting was not persisted.

Useful? React with 👍 / 👎.

Fix four P0 bugs in the ML pipeline: - Calibration was fitted on test set, inflating reported metrics (now uses train set with internal 5-fold CV) - Isotonic calibration replaced with sigmoid (Platt scaling), which is stable with small sample sizes (~100-200 txns) - Ensemble LightGBM was training on full DB instead of the train split, leaking validation data into weight optimization (now uses new fit() method) - min_samples_per_category raised from 3 to 5 to match 5-fold CV requirement Adds TransactionCategorizer.fit() for pre-split training without DB access. Adds 7 e2e tests covering train(), fit(), and ensemble pipeline with @pytest.mark.slow marker for CI gating. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

5x5 repeated stratified k-fold CV on human-reviewed transactions only. Evaluates LightGBM, Naive Bayes, and Ensemble with full metrics suite (Macro F1, Log Loss, Brier, ECE, risk-coverage, Wilson CIs, Nadeau-Bengio). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…construction Replace the information-destroying fake probability reconstruction (argmax + uniform residual) with real calibrated probability vectors from LightGBM. Improves ensemble log loss by 9.4% and auto-approve error rate to 0.69%. Removes ~100 lines of dead code (LightGBMWrapper, fake proba converters). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add SepaFieldParser for extracting creditor_id, iban_bank_prefix, and mandate_ref from transaction purpose fields. Integrate into FeatureExtractor to strip SEPA noise from merchant names and text features. Fix creditor ID regex to match "Creditor ID:" label format used in prod data (was only matching CRED+ marker and Gläubiger-ID, finding 0% of transactions). Also fix pre-existing ty type error in ensemble_categorizer.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Experiment testing local LLMs (Phi-4-mini 3.8B, Qwen3-8B) as transaction classifiers, both standalone and hybrid with the ML ensemble. Key findings: ensemble (F1=0.867) outperforms best LLM (F1=0.674) at all thresholds. Decision: skip LLM integration. Implementation includes KV cache prefilling for mlx-lm (~6.5x speedup) and Qwen3 thinking mode fix (enable_thinking=False). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tune LightGBM hyperparameters with Optuna (100 trials, TPE sampler, MedianPruner). 5x5 CV macro F1 improves 0.8466 → 0.8549 (+0.83pp). Key changes: - Add char+word TF-IDF with SVD dimensionality reduction to categorizer - Batch ML inference in predict_with_confidence (avoid per-txn overhead) - Optuna tuning script with precomputed folds and early stopping - Update config with tuned params (num_leaves=97, lr=0.043, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move torch, sentence-transformers, peft, mlx-lm, optuna, matplotlib, and pyarrow out of core dependencies into dependency groups (experiments, finetune, mlx). Production install drops from 121 to 77 packages. Run experiment scripts with: uv run --group experiments scripts/tune_lgbm.py Also removes orphaned duplicate dependencies block under [project.urls]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address Codex review comments on PR #10: - Return 400 on invalid threshold values (non-numeric, null) - Check response.ok before showing success in settings UI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Feb 10, 2026

View reviewed changes

davidchris and others added 8 commits February 10, 2026 19:15

fix: validate threshold payload and check response status in UI

e3f37a4

Address Codex review comments on PR #10: - Return 400 on invalid threshold values (non-numeric, null) - Check response.ok before showing success in settings UI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

davidchris changed the title ~~Fix merchant mapper bug + configurable auto-approve threshold~~ Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning Feb 16, 2026

davidchris merged commit c7854de into main Feb 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning#10

Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning#10
davidchris merged 9 commits intomainfrom
feat/improve-auto-predictions

davidchris commented Feb 10, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidchris commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug fixes (P0)

Feature pipeline improvements

Hyperparameter tuning

Infrastructure

Results (5x5 CV, 1856 transactions, 26 categories)

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidchris commented Feb 10, 2026 •

edited

Loading