Skip to content

Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning#10

Merged
davidchris merged 9 commits intomainfrom
feat/improve-auto-predictions
Feb 16, 2026
Merged

Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning#10
davidchris merged 9 commits intomainfrom
feat/improve-auto-predictions

Conversation

@davidchris
Copy link
Copy Markdown
Owner

@davidchris davidchris commented Feb 10, 2026

Summary

End-to-end improvement of the ML categorization pipeline, from critical bug fixes through feature engineering to hyperparameter optimization.

Bug fixes (P0)

  • Fix calibration data leakage: CalibratedClassifierCV was fitted on the test set, inflating reported metrics. Now uses internal CV on training data with sigmoid (Platt) scaling
  • Fix ensemble training on full DB: LightGBM in ensemble was querying ALL transactions instead of the training split. Refactored to accept pre-split data via fit()
  • Fix merchant mapper dead code: Confidence capped at min(0.95, ...) but checked > 0.95 (strict), so all 103 merchant rules never fired. Raised cap to 0.98 and changed to >=
  • Validate threshold API payload: Return 400 on invalid values; check response.ok in UI before showing success

Feature pipeline improvements

  • Dual TF-IDF/SVD features: Combined char_wb (3-5, 1000 features) + word (1-2, 500 features) vectorizers, reduced via TruncatedSVD(n_components=100) — replaces single 500-feature char vectorizer
  • SEPA field extraction: New SepaFieldParser extracts creditor_id, IBAN prefix, mandate ref from purpose text. Strips SEPA noise markers from TF-IDF input
  • Full LightGBM probabilities in ensemble: Replaced fake probability reconstruction (argmax + uniform residual) with real calibrated predict_proba() vectors
  • Batch ML inference: Refactored predict_with_confidence() to batch all transactions instead of per-transaction model calls

Hyperparameter tuning

  • Optuna tuning (100 trials, TPE sampler, MedianPruner): Systematic search over 9 LightGBM hyperparameters with precomputed features and early stopping

Infrastructure

  • Configurable auto-approve threshold: Settings UI slider (0.50–0.99) backed by app_settings DB table and GET/PUT /api/ml/settings endpoints
  • Baseline evaluation script: scripts/establish_baseline.py — 5x5 repeated stratified k-fold CV with full metrics suite (macro F1, log loss, Brier score, ECE, risk-coverage, per-category Wilson CIs)
  • Slim production deps: Moved 7 experiment-only packages (torch, sentence-transformers, peft, mlx-lm, optuna, matplotlib, pyarrow) to dependency groups. Production install 121 → 77 packages
  • ML end-to-end tests: 210-line test suite covering training, prediction, serialization, and ensemble

Results (5x5 CV, 1856 transactions, 26 categories)

Model Before After Change
LightGBM 0.849 ± 0.022 0.856 ± 0.018 +0.7pp
Ensemble 0.855 ± 0.024 0.858 ± 0.018 +0.3pp
Naive Bayes 0.793 ± 0.023 0.793 ± 0.023 unchanged

Variance reduced across the board. Ensemble log loss improved 9.4% from probability fix.

Test plan

  • All 145 tests pass (uv run pytest)
  • ruff check, ruff format, ty check all pass
  • CI passes
  • 5x5 CV baseline evaluation confirms improvement
  • Settings slider renders, persists, and validates input
  • Production install works without experiment dependencies

🤖 Generated with Claude Code

…hold

Fix merchant mapper rules being dead code: confidence was capped at 0.95
but checks used strict > 0.95, so rules never fired. Raise cap to 0.98
and change comparisons to >=. Add configurable auto-approve threshold
(default 0.95) via Settings UI slider, backed by new app_settings DB
table and GET/PUT /ml/settings API endpoints. Replace all hardcoded 0.95
thresholds in batch-predict, re-predict, and upload flows.

Closes #9 (related: creates issue for auto-retrain after reviews)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 368eb8c846

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

api/ml.py Outdated
async def update_ml_settings(payload: dict, db: Session = Depends(get_db_session)) -> dict:
"""Update ML settings."""
if "auto_approve_threshold" in payload:
value = float(payload["auto_approve_threshold"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle invalid threshold payloads with a client error

update_ml_settings casts payload["auto_approve_threshold"] with float(...) but does not catch ValueError/TypeError, so requests like {"auto_approve_threshold": "abc"} or null raise an unhandled exception and return 500 instead of a validation 4xx. This makes malformed client input look like a server outage and breaks callers that rely on proper client-error semantics.

Useful? React with 👍 / 👎.

Comment on lines +1446 to +1449
.then(r => r.json())
.then(data => {
window._autoApproveThreshold = parseFloat(data.auto_approve_threshold);
status.textContent = 'Saved! (' + parseFloat(data.auto_approve_threshold).toFixed(2) + ')';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat non-OK save responses as failures in threshold UI

The save flow always parses JSON and enters the success branch without checking response.ok, so a 4xx/5xx response from /api/ml/settings is still rendered as success. In those cases data.auto_approve_threshold is typically absent, producing NaN and showing a misleading Saved! (NaN) status even though the setting was not persisted.

Useful? React with 👍 / 👎.

davidchris and others added 8 commits February 10, 2026 19:15
Fix four P0 bugs in the ML pipeline:
- Calibration was fitted on test set, inflating reported metrics (now uses
  train set with internal 5-fold CV)
- Isotonic calibration replaced with sigmoid (Platt scaling), which is
  stable with small sample sizes (~100-200 txns)
- Ensemble LightGBM was training on full DB instead of the train split,
  leaking validation data into weight optimization (now uses new fit() method)
- min_samples_per_category raised from 3 to 5 to match 5-fold CV requirement

Adds TransactionCategorizer.fit() for pre-split training without DB access.
Adds 7 e2e tests covering train(), fit(), and ensemble pipeline with
@pytest.mark.slow marker for CI gating.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5x5 repeated stratified k-fold CV on human-reviewed transactions only.
Evaluates LightGBM, Naive Bayes, and Ensemble with full metrics suite
(Macro F1, Log Loss, Brier, ECE, risk-coverage, Wilson CIs, Nadeau-Bengio).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…construction

Replace the information-destroying fake probability reconstruction (argmax +
uniform residual) with real calibrated probability vectors from LightGBM.
Improves ensemble log loss by 9.4% and auto-approve error rate to 0.69%.
Removes ~100 lines of dead code (LightGBMWrapper, fake proba converters).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SepaFieldParser for extracting creditor_id, iban_bank_prefix, and
mandate_ref from transaction purpose fields. Integrate into FeatureExtractor
to strip SEPA noise from merchant names and text features. Fix creditor ID
regex to match "Creditor ID:" label format used in prod data (was only
matching CRED+ marker and Gläubiger-ID, finding 0% of transactions).

Also fix pre-existing ty type error in ensemble_categorizer.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Experiment testing local LLMs (Phi-4-mini 3.8B, Qwen3-8B) as transaction
classifiers, both standalone and hybrid with the ML ensemble. Key findings:
ensemble (F1=0.867) outperforms best LLM (F1=0.674) at all thresholds.
Decision: skip LLM integration.

Implementation includes KV cache prefilling for mlx-lm (~6.5x speedup)
and Qwen3 thinking mode fix (enable_thinking=False).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tune LightGBM hyperparameters with Optuna (100 trials, TPE sampler,
MedianPruner). 5x5 CV macro F1 improves 0.8466 → 0.8549 (+0.83pp).

Key changes:
- Add char+word TF-IDF with SVD dimensionality reduction to categorizer
- Batch ML inference in predict_with_confidence (avoid per-txn overhead)
- Optuna tuning script with precomputed folds and early stopping
- Update config with tuned params (num_leaves=97, lr=0.043, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move torch, sentence-transformers, peft, mlx-lm, optuna, matplotlib,
and pyarrow out of core dependencies into dependency groups (experiments,
finetune, mlx). Production install drops from 121 to 77 packages.

Run experiment scripts with: uv run --group experiments scripts/tune_lgbm.py

Also removes orphaned duplicate dependencies block under [project.urls].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address Codex review comments on PR #10:
- Return 400 on invalid threshold values (non-numeric, null)
- Check response.ok before showing success in settings UI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@davidchris davidchris changed the title Fix merchant mapper bug + configurable auto-approve threshold Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning Feb 16, 2026
@davidchris davidchris merged commit c7854de into main Feb 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant