Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning#10
Improve ML auto-predictions: bug fixes, feature pipeline, Optuna tuning#10davidchris merged 9 commits intomainfrom
Conversation
…hold Fix merchant mapper rules being dead code: confidence was capped at 0.95 but checks used strict > 0.95, so rules never fired. Raise cap to 0.98 and change comparisons to >=. Add configurable auto-approve threshold (default 0.95) via Settings UI slider, backed by new app_settings DB table and GET/PUT /ml/settings API endpoints. Replace all hardcoded 0.95 thresholds in batch-predict, re-predict, and upload flows. Closes #9 (related: creates issue for auto-retrain after reviews) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 368eb8c846
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
api/ml.py
Outdated
| async def update_ml_settings(payload: dict, db: Session = Depends(get_db_session)) -> dict: | ||
| """Update ML settings.""" | ||
| if "auto_approve_threshold" in payload: | ||
| value = float(payload["auto_approve_threshold"]) |
There was a problem hiding this comment.
Handle invalid threshold payloads with a client error
update_ml_settings casts payload["auto_approve_threshold"] with float(...) but does not catch ValueError/TypeError, so requests like {"auto_approve_threshold": "abc"} or null raise an unhandled exception and return 500 instead of a validation 4xx. This makes malformed client input look like a server outage and breaks callers that rely on proper client-error semantics.
Useful? React with 👍 / 👎.
web/pages/settings_page.py
Outdated
| .then(r => r.json()) | ||
| .then(data => { | ||
| window._autoApproveThreshold = parseFloat(data.auto_approve_threshold); | ||
| status.textContent = 'Saved! (' + parseFloat(data.auto_approve_threshold).toFixed(2) + ')'; |
There was a problem hiding this comment.
Treat non-OK save responses as failures in threshold UI
The save flow always parses JSON and enters the success branch without checking response.ok, so a 4xx/5xx response from /api/ml/settings is still rendered as success. In those cases data.auto_approve_threshold is typically absent, producing NaN and showing a misleading Saved! (NaN) status even though the setting was not persisted.
Useful? React with 👍 / 👎.
Fix four P0 bugs in the ML pipeline: - Calibration was fitted on test set, inflating reported metrics (now uses train set with internal 5-fold CV) - Isotonic calibration replaced with sigmoid (Platt scaling), which is stable with small sample sizes (~100-200 txns) - Ensemble LightGBM was training on full DB instead of the train split, leaking validation data into weight optimization (now uses new fit() method) - min_samples_per_category raised from 3 to 5 to match 5-fold CV requirement Adds TransactionCategorizer.fit() for pre-split training without DB access. Adds 7 e2e tests covering train(), fit(), and ensemble pipeline with @pytest.mark.slow marker for CI gating. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5x5 repeated stratified k-fold CV on human-reviewed transactions only. Evaluates LightGBM, Naive Bayes, and Ensemble with full metrics suite (Macro F1, Log Loss, Brier, ECE, risk-coverage, Wilson CIs, Nadeau-Bengio). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…construction Replace the information-destroying fake probability reconstruction (argmax + uniform residual) with real calibrated probability vectors from LightGBM. Improves ensemble log loss by 9.4% and auto-approve error rate to 0.69%. Removes ~100 lines of dead code (LightGBMWrapper, fake proba converters). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SepaFieldParser for extracting creditor_id, iban_bank_prefix, and mandate_ref from transaction purpose fields. Integrate into FeatureExtractor to strip SEPA noise from merchant names and text features. Fix creditor ID regex to match "Creditor ID:" label format used in prod data (was only matching CRED+ marker and Gläubiger-ID, finding 0% of transactions). Also fix pre-existing ty type error in ensemble_categorizer.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Experiment testing local LLMs (Phi-4-mini 3.8B, Qwen3-8B) as transaction classifiers, both standalone and hybrid with the ML ensemble. Key findings: ensemble (F1=0.867) outperforms best LLM (F1=0.674) at all thresholds. Decision: skip LLM integration. Implementation includes KV cache prefilling for mlx-lm (~6.5x speedup) and Qwen3 thinking mode fix (enable_thinking=False). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tune LightGBM hyperparameters with Optuna (100 trials, TPE sampler, MedianPruner). 5x5 CV macro F1 improves 0.8466 → 0.8549 (+0.83pp). Key changes: - Add char+word TF-IDF with SVD dimensionality reduction to categorizer - Batch ML inference in predict_with_confidence (avoid per-txn overhead) - Optuna tuning script with precomputed folds and early stopping - Update config with tuned params (num_leaves=97, lr=0.043, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move torch, sentence-transformers, peft, mlx-lm, optuna, matplotlib, and pyarrow out of core dependencies into dependency groups (experiments, finetune, mlx). Production install drops from 121 to 77 packages. Run experiment scripts with: uv run --group experiments scripts/tune_lgbm.py Also removes orphaned duplicate dependencies block under [project.urls]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address Codex review comments on PR #10: - Return 400 on invalid threshold values (non-numeric, null) - Check response.ok before showing success in settings UI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
End-to-end improvement of the ML categorization pipeline, from critical bug fixes through feature engineering to hyperparameter optimization.
Bug fixes (P0)
CalibratedClassifierCVwas fitted on the test set, inflating reported metrics. Now uses internal CV on training data with sigmoid (Platt) scalingfit()min(0.95, ...)but checked> 0.95(strict), so all 103 merchant rules never fired. Raised cap to 0.98 and changed to>=response.okin UI before showing successFeature pipeline improvements
TruncatedSVD(n_components=100)— replaces single 500-feature char vectorizerSepaFieldParserextracts creditor_id, IBAN prefix, mandate ref from purpose text. Strips SEPA noise markers from TF-IDF inputpredict_proba()vectorspredict_with_confidence()to batch all transactions instead of per-transaction model callsHyperparameter tuning
Infrastructure
app_settingsDB table andGET/PUT /api/ml/settingsendpointsscripts/establish_baseline.py— 5x5 repeated stratified k-fold CV with full metrics suite (macro F1, log loss, Brier score, ECE, risk-coverage, per-category Wilson CIs)Results (5x5 CV, 1856 transactions, 26 categories)
Variance reduced across the board. Ensemble log loss improved 9.4% from probability fix.
Test plan
uv run pytest)ruff check,ruff format,ty checkall pass🤖 Generated with Claude Code