From 8ecc14bca9c9ad07c09e106679c192c60080729e Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sun, 19 Apr 2026 22:01:24 +1000 Subject: [PATCH 1/5] feat(databricks-skills): add databricks-mlflow-ml skill for classic ML Fills the gap between databricks-mlflow-evaluation (GenAI agent eval) and databricks-model-serving (real-time endpoints). Covers: - Classic ML model training with MLflow tracking (sklearn / XGBoost / PyTorch) - Experiment creation with UC volume artifact_location (required in UC-enforced workspaces) - Unity Catalog model registration with three-level names - @champion / @challenger alias management - Batch inference via mlflow.pyfunc.load_model (notebook, up to ~10k rows) - Distributed batch via mlflow.pyfunc.spark_udf in Lakeflow SDP pipelines Structure mirrors databricks-mlflow-evaluation: - SKILL.md: workflows + trigger description + quick start - references/GOTCHAS.md: 12 common mistakes with symptoms + fixes - references/CRITICAL-interfaces.md: exact API signatures + models:/ URI format - references/patterns-experiment-setup.md: UC volume artifact_location setup - references/patterns-training.md: logging with signature + input_example - references/patterns-uc-registration.md: register + alias + verify + A/B - references/patterns-batch-inference.md: pyfunc.load_model + spark_udf + ai_query anti-pattern - references/user-journeys.md: 7 end-to-end workflows including debugging Key gotchas covered that other MLflow guides miss: - Experiment creation now requires UC volume artifact_location in UC-enforced workspaces (DBFS root writes are rejected) - mlflow.set_registry_uri('databricks-uc') is required; silent workspace registry fallback is the #1 support question - ai_query does NOT work on custom UC-registered models unless they're deployed to a serving endpoint; use pyfunc.load_model or spark_udf instead - UC aliases (@champion/@challenger) replace deprecated stage transitions (transition_model_version_stage is a no-op on UC models) - mlflow.pyfunc.spark_udf must be constructed at module scope in Lakeflow SDP pipelines, not inside the function body Tested against MLflow 2.16+ on Databricks Runtime 15.4 LTS. Content battle- tested in the Coles Vibe Workshop (classic-ML track running in an airgapped environment where online MLflow docs aren't reachable). --- .../databricks-mlflow-ml/SKILL.md | 125 +++++++++ .../references/CRITICAL-interfaces.md | 219 +++++++++++++++ .../references/GOTCHAS.md | 265 ++++++++++++++++++ .../references/patterns-batch-inference.md | 244 ++++++++++++++++ .../references/patterns-experiment-setup.md | 141 ++++++++++ .../references/patterns-training.md | 205 ++++++++++++++ .../references/patterns-uc-registration.md | 232 +++++++++++++++ .../references/user-journeys.md | 195 +++++++++++++ 8 files changed, 1626 insertions(+) create mode 100644 databricks-skills/databricks-mlflow-ml/SKILL.md create mode 100644 databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md create mode 100644 databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md create mode 100644 databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md create mode 100644 databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md create mode 100644 databricks-skills/databricks-mlflow-ml/references/patterns-training.md create mode 100644 databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md create mode 100644 databricks-skills/databricks-mlflow-ml/references/user-journeys.md diff --git a/databricks-skills/databricks-mlflow-ml/SKILL.md b/databricks-skills/databricks-mlflow-ml/SKILL.md new file mode 100644 index 00000000..43d4a2ed --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/SKILL.md @@ -0,0 +1,125 @@ +--- +name: databricks-mlflow-ml +description: "Classic ML model lifecycle on Databricks with MLflow and Unity Catalog. Use when training scikit-learn / XGBoost / PyTorch models with MLflow tracking, registering models to Unity Catalog (three-level names, @champion / @challenger aliases), setting mlflow.set_registry_uri('databricks-uc'), logging experiments with UC volume artifact_location, loading registered models via mlflow.pyfunc.load_model or mlflow.pyfunc.spark_udf, and running batch inference (notebook or Lakeflow SDP pipeline). Not for GenAI agent evaluation — use databricks-mlflow-evaluation for that. Not for Model Serving endpoints — use databricks-model-serving for that." +--- + +# MLflow + Unity Catalog — Classic ML + +## Before Writing Any Code + +1. **Read `GOTCHAS.md`** — 12 common mistakes that cause silent failures or wasted time +2. **Read `CRITICAL-interfaces.md`** — exact API signatures and the `models:/` URI format + +## End-to-End Workflows + +Follow the workflow that matches your goal. Each step indicates which reference files to read. + +### Workflow 1: Train → Register → Batch Score (most common) + +For building a production-shape classic ML model with UC-native lineage. Covers the full path from raw features to predictions in a downstream table. + +| Step | Action | Reference Files | +|------|--------|-----------------| +| 1 | Create experiment with UC volume artifact_location | `patterns-experiment-setup.md` (Pattern 1) | +| 2 | Train model with signature + input_example | `patterns-training.md` (Patterns 1–3) | +| 3 | Register to Unity Catalog with three-level name | `patterns-uc-registration.md` (Patterns 1–2) | +| 4 | Set `@champion` alias | `patterns-uc-registration.md` (Pattern 3) | +| 5 | Verify registration (Navigator check) | `patterns-uc-registration.md` (Pattern 4) + `GOTCHAS.md` #5 | +| 6 | Load + score in notebook (Tier 1) | `patterns-batch-inference.md` (Patterns 1–2) | +| 7 | Optional: Lakeflow SDP batch via `spark_udf` | `patterns-batch-inference.md` (Patterns 3–4) | + +### Workflow 2: Retrain + Promote (A/B pattern) + +For adding a new version of an already-registered model and promoting it without touching downstream loader code. + +| Step | Action | Reference Files | +|------|--------|-----------------| +| 1 | Train new version, log to same UC model name | `patterns-training.md` (Pattern 4) | +| 2 | Register as new version | `patterns-uc-registration.md` (Pattern 2) | +| 3 | Set `@challenger` alias | `patterns-uc-registration.md` (Pattern 3) | +| 4 | Validate `@challenger` predictions vs `@champion` | `patterns-batch-inference.md` (Pattern 5) | +| 5 | Swap aliases (`@challenger` → `@champion`) | `patterns-uc-registration.md` (Pattern 5) | + +Downstream loader code that uses `models:/catalog.schema.model@champion` picks up the new version on next load — no code change needed. + +### Workflow 3: Debugging a Failed Registration or Load + +For the two most common support questions: "why did my model go to workspace registry?" and "why does pyfunc.load_model fail?" + +| Step | Action | Reference Files | +|------|--------|-----------------| +| 1 | Verify registry URI is set to `databricks-uc` | `GOTCHAS.md` #1 | +| 2 | Verify three-level name | `GOTCHAS.md` #2 | +| 3 | Confirm model appears in Catalog Explorer | `patterns-uc-registration.md` (Pattern 4) | +| 4 | Check `CREATE MODEL` permissions | `GOTCHAS.md` #7 | +| 5 | Diagnose load failures | `GOTCHAS.md` #3, #8, #11 | + +## Quick Start + +The minimum viable path from untrained model to UC-registered, notebook-scored: + +```python +import mlflow +from mlflow.models import infer_signature +from mlflow import MlflowClient + +# 1. Configure: UC registry + UC volume for artifacts (both required) +mlflow.set_registry_uri("databricks-uc") +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) + +# 2. Train + log +with mlflow.start_run() as run: + model.fit(X_train, y_train) + signature = infer_signature(X_train, model.predict(X_train[:5])) + mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=signature, + input_example=X_train.iloc[:5], + ) + +# 3. Register + alias +MODEL_NAME = "my_catalog.my_schema.my_model" +result = mlflow.register_model(f"runs:/{run.info.run_id}/model", MODEL_NAME) +MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", result.version) + +# 4. Load + predict (in any notebook, anywhere) +model = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") +predictions = model.predict(X_test) +``` + +## Why This Skill Exists + +Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**. The distinction matters because the APIs diverged: + +| Skill | Scope | MLflow API Surface | +|-------|-------|--------------------| +| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces | +| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` | +| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` | + +If you're training a forecasting / classification / regression model, registering it to UC, and scoring it in a notebook or Lakeflow pipeline — this skill. If you're evaluating an LLM agent's output quality — evaluation skill. If you're exposing a model behind an HTTP endpoint — model-serving skill. + +## Common Issues + +| Issue | Solution | +|-------|----------| +| **Model registered but not visible in Catalog Explorer** | Missing `mlflow.set_registry_uri("databricks-uc")`. See `GOTCHAS.md` #1. | +| **`RestException: INVALID_PARAMETER_VALUE` on `register_model`** | Two-level name used. UC requires `catalog.schema.name`. See `GOTCHAS.md` #2. | +| **Experiment creation fails with storage errors** | Missing `artifact_location` pointing at a UC volume. See `GOTCHAS.md` #4. | +| **`PERMISSION_DENIED: CREATE MODEL`** | Pair/user needs `CREATE MODEL ON SCHEMA `. See `GOTCHAS.md` #7. | +| **`pyfunc.load_model` returns but `predict()` fails** | Signature wasn't logged; inputs don't coerce. See `GOTCHAS.md` #8. | +| **Agent proposes `ai_query` for batch inference** | Wrong primitive — that requires a serving endpoint. Use `pyfunc.load_model` or `spark_udf`. See `GOTCHAS.md` #9. | + +## Reference Files + +- [`GOTCHAS.md`](references/GOTCHAS.md) — 12 common mistakes + fixes +- [`CRITICAL-interfaces.md`](references/CRITICAL-interfaces.md) — API signatures + `models:/` URI format +- [`patterns-experiment-setup.md`](references/patterns-experiment-setup.md) — experiment creation with UC volume artifact_location +- [`patterns-training.md`](references/patterns-training.md) — logging models with signature + input_example + autologging +- [`patterns-uc-registration.md`](references/patterns-uc-registration.md) — register + alias + verify + A/B promotion +- [`patterns-batch-inference.md`](references/patterns-batch-inference.md) — notebook (`pyfunc.load_model`) + Lakeflow (`spark_udf`) + champion-vs-challenger +- [`user-journeys.md`](references/user-journeys.md) — end-to-end workflows with decision points diff --git a/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md b/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md new file mode 100644 index 00000000..a40483c5 --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md @@ -0,0 +1,219 @@ +# CRITICAL-interfaces — Exact API signatures + +The minimum set of APIs that every classic-ML + UC workflow touches. Copy-pasteable, with the exact arguments that matter. + +--- + +## Registry URI configuration + +```python +mlflow.set_registry_uri("databricks-uc") # Call at the start of every session +mlflow.get_registry_uri() # Returns "databricks-uc" if set correctly +``` + +**Must be called BEFORE** any `register_model` or `load_model` call. Idempotent to repeat. + +--- + +## Experiment creation with UC volume artifact_location + +```python +mlflow.set_experiment( + experiment_name="/Users//", + artifact_location="dbfs:/Volumes////", +) +``` + +**`artifact_location` is required** for UC-enforced workspaces. The volume must exist: + +```sql +CREATE VOLUME IF NOT EXISTS ..; +``` + +--- + +## `models:/` URI format + +All load / deploy / spark_udf calls use this URI. **One format to memorize:** + +``` +models:/..@ +``` + +Examples: +``` +models:/my_catalog.my_schema.grocery_forecaster@champion +models:/my_catalog.my_schema.grocery_forecaster@challenger +``` + +**Avoid** these forms (either legacy, or not-UC-native): +``` +models:/grocery_forecaster/3 # workspace registry, version number +models:/my_schema.grocery_forecaster/3 # invalid in UC +``` + +--- + +## Model logging (sklearn-flavored) + +```python +mlflow.sklearn.log_model( + sk_model=, + artifact_path="model", # convention — keep as "model" + signature=, # REQUIRED — use infer_signature() + input_example=, # REQUIRED — 5 real rows + registered_model_name=None, # leave None; register separately (cleaner) + code_paths=, + extra_pip_requirements=, # only if custom deps beyond environment +) +``` + +**Signature inference:** +```python +from mlflow.models import infer_signature +signature = infer_signature(X_train, model.predict(X_train[:5])) +``` + +**Other flavors with identical signature:** +- `mlflow.xgboost.log_model(xgb_model=..., ...)` +- `mlflow.pytorch.log_model(pytorch_model=..., ...)` +- `mlflow.tensorflow.log_model(model=..., ...)` +- `mlflow.pyfunc.log_model(python_model=..., artifact_path=..., ...)` — for custom PythonModel wrappers + +--- + +## Explicit registration + +```python +result = mlflow.register_model( + model_uri=f"runs:/{run_id}/model", # "runs://" + name="..", # three-level, not optional + tags=, +) +# result.name: str — fully qualified name +# result.version: str — newly-created version (e.g., "1", "2") +``` + +--- + +## Alias management + +```python +from mlflow import MlflowClient +client = MlflowClient() + +# Set (creates if missing, moves if exists) +client.set_registered_model_alias( + name="..", + alias="champion", # or "challenger", or custom + version="", # accepts str or int +) + +# Get current alias mapping +model = client.get_registered_model("..") +print(model.aliases) # {"champion": "3", "challenger": "4"} + +# Delete +client.delete_registered_model_alias( + name="..", + alias="challenger", +) +``` + +--- + +## Loading — notebook / single-node + +```python +model = mlflow.pyfunc.load_model( + model_uri="models:/..@champion", +) + +# Predict on a pandas DataFrame matching the signature +predictions = model.predict(features_df) +``` + +**Returns:** `mlflow.pyfunc.PyFuncModel`, regardless of the original flavor. Expose `.metadata.signature` for schema. + +--- + +## Loading — distributed / Lakeflow SDP + +```python +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri="models:/..@champion", + result_type="double", # or "array" for multi-output + env_manager="local", # "local" | "virtualenv" | "conda" +) + +# Apply to a Spark DataFrame +df_with_predictions = df.withColumn( + "prediction", + predict_udf("feature_a", "feature_b", "feature_c"), +) +``` + +**Construct ONCE at module scope** in Lakeflow pipelines. See `GOTCHAS.md` #11. + +--- + +## Model introspection + +```python +from mlflow.models import get_model_info + +info = get_model_info("models:/..@champion") +info.signature # ModelSignature with inputs/outputs +info.flavors # {"sklearn": {...}, "python_function": {...}} +info.utc_time_created +info.model_uuid +``` + +Useful when debugging load-vs-predict mismatches. + +--- + +## Run + experiment queries (introspection) + +```python +runs = mlflow.search_runs( + experiment_names=["/Users/me@company.com/forecasting"], + filter_string="metrics.r2 > 0.8", + order_by=["metrics.r2 DESC"], + max_results=5, +) +# Returns a pandas DataFrame with run_id, metrics, params, etc. + +best_run_id = runs.iloc[0]["run_id"] +``` + +--- + +## SQL introspection (UC-native) + +```sql +-- Does the model exist and which aliases are set? +DESCRIBE MODEL ..; + +-- List all model versions +SHOW MODEL VERSIONS ON MODEL ..; + +-- Check grants +SHOW GRANTS ON MODEL ..; +SHOW GRANTS ON SCHEMA .; +``` + +--- + +## What's NOT in this skill + +If you see these in code, you're likely in the wrong skill: + +| API | Belongs in | +|-----|------------| +| `mlflow.genai.evaluate(...)` | `databricks-mlflow-evaluation` | +| `@scorer` decorator, `GuidelinesJudge`, etc. | `databricks-mlflow-evaluation` | +| `databricks.sdk.service.serving.EndpointCoreConfigInput` | `databricks-model-serving` | +| `ai_query('', ...)` | Wrong pattern — use `pyfunc.load_model` or `spark_udf` instead (see `GOTCHAS.md` #9) | +| `transition_model_version_stage(...)` | Deprecated — use aliases (see `GOTCHAS.md` #6) | diff --git a/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md b/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md new file mode 100644 index 00000000..92615de2 --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md @@ -0,0 +1,265 @@ +# GOTCHAS — Classic ML on MLflow + Unity Catalog + +Twelve mistakes that silently waste hours. Read before writing any code. + +--- + +## 1. Missing `mlflow.set_registry_uri("databricks-uc")` → workspace registry + +**Symptom:** `register_model` succeeds, but the model doesn't appear in Catalog Explorer. It's in the legacy **workspace registry** (visible under the MLflow icon in the left nav), not Unity Catalog. + +**Fix:** +```python +import mlflow +mlflow.set_registry_uri("databricks-uc") # MUST come before register_model / load_model +``` + +**Verification:** +```python +assert mlflow.get_registry_uri() == "databricks-uc" +``` + +**Why it bites:** defaults still route to the workspace registry for backward compatibility. The only indicator you missed it is a URL that shows `/ml/models/` instead of `/explore/data/models///`. + +--- + +## 2. Two-level model names → rejected or wrong registry + +**Symptom:** `RestException: INVALID_PARAMETER_VALUE: Invalid model name`, or the model registers to the workspace registry silently. + +**Fix:** always use three-level names: `catalog.schema.model_name`. + +```python +# WRONG +mlflow.register_model(model_uri, "my_model") +mlflow.register_model(model_uri, "my_schema.my_model") + +# CORRECT +mlflow.register_model(model_uri, "my_catalog.my_schema.my_model") +``` + +**Why it bites:** the error message depends on the registry URI. With UC URI + two-level name → parameter error. With workspace URI + two-level name → registers successfully to workspace (the silently-wrong case). + +--- + +## 3. Loading with version number instead of alias + +**Symptom:** works today, breaks tomorrow when someone registers a new version. You've hard-coded a version number into every downstream consumer. + +**Fix:** load via alias, never version. + +```python +# FRAGILE — every retrain requires updating every loader +model = mlflow.pyfunc.load_model("models:/my_catalog.my_schema.my_model/3") + +# STABLE — promote a new version by moving @champion; no loader changes +model = mlflow.pyfunc.load_model("models:/my_catalog.my_schema.my_model@champion") +``` + +**Why it bites:** aliases are the UC-native way to decouple loader code from model lifecycle. Version numbers are legacy. New infrastructure (Lakeflow, Genie) assumes alias-based loading. + +--- + +## 4. Experiment creation without UC volume `artifact_location` + +**Symptom:** experiment creates, but any `log_model` call fails with storage / permission errors. Or artifacts land in DBFS root (deprecated) and can't be loaded downstream. + +**Fix:** when you create the experiment, pin it to a UC volume. + +```python +# Prerequisite: the UC volume must exist +# CREATE VOLUME my_catalog.my_schema.mlflow_artifacts; + +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) +``` + +**Why it bites:** the default `artifact_location` used to be DBFS root. Unity-Catalog-enforced workspaces reject DBFS root writes, so `log_model` fails with opaque errors. Pointing at a UC volume makes artifact storage first-class-governed and keeps lineage intact. + +**When the experiment already exists without a UC volume:** you can't retroactively change `artifact_location`. Either (a) delete + recreate, or (b) create a new experiment. Don't try to relocate artifacts manually. + +--- + +## 5. Trusting `register_model` success without verifying in UC + +**Symptom:** `register_model` returns a `ModelVersion` object. Feels successful. But the model is in workspace registry, or the version number is stale, or an alias wasn't set. + +**Fix:** always verify explicitly. + +```sql +-- In a SQL cell or notebook: +DESCRIBE MODEL my_catalog.my_schema.my_model; +``` + +Or via Python: +```python +from mlflow import MlflowClient +model = MlflowClient().get_registered_model("my_catalog.my_schema.my_model") +assert "champion" in model.aliases, "Missing @champion alias" +``` + +Or visually: open Catalog Explorer → `my_catalog` → `my_schema` → **Models** tab. If the model is under MLflow's workspace UI instead, you registered to the wrong place (see #1). + +**Why it bites:** `register_model`'s return value only tells you a version was created. It doesn't tell you *where* or *with what aliases*. The Navigator's V-step in pair programming: verify before trusting. + +--- + +## 6. Setting the alias to `"production"` or `"staging"` (legacy MLflow stages) + +**Symptom:** you remember MLflow had `stage="Production"` / `"Staging"` transitions. You try the same with aliases and nothing recognizes them. + +**Fix:** UC model aliases are free-form labels. The conventions are `@champion` (current winner) and `@challenger` (under evaluation). MLflow stages are deprecated in the UC registry. + +```python +# WRONG (legacy stage concept) +MlflowClient().set_registered_model_alias(name, "Production", version) + +# CORRECT +MlflowClient().set_registered_model_alias(name, "champion", version) +``` + +**Why it bites:** the old `transition_model_version_stage()` API still exists but is a no-op on UC-registered models. No error, no effect. + +--- + +## 7. Missing `CREATE MODEL ON SCHEMA` permission + +**Symptom:** `RestException: PERMISSION_DENIED: User ... does not have CREATE MODEL permission`. + +**Fix:** grant the permission at the schema level. + +```sql +GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `user@company.com`; +-- Or for a group: +GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `data-science-team`; +``` + +**Why it bites:** workspace admins often assume `USE SCHEMA` covers model registration. It doesn't — `CREATE MODEL` is a separate UC privilege that must be granted explicitly. + +**Verification:** +```sql +SHOW GRANTS ON SCHEMA my_catalog.my_schema; +``` + +--- + +## 8. Logging a model without `signature` or `input_example` + +**Symptom:** `mlflow.pyfunc.load_model(...)` returns an object, but `.predict(spark_df)` raises cryptic coercion errors. Or predictions silently cast (int → float, string → category) and produce wrong numbers. + +**Fix:** always log both. + +```python +from mlflow.models import infer_signature + +signature = infer_signature(X_train, model.predict(X_train[:5])) +mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=signature, + input_example=X_train.iloc[:5], # 5 real rows for the pyfunc wrapper to introspect +) +``` + +**Why it bites:** without a signature, the pyfunc wrapper can't coerce inputs — it accepts whatever you pass, then downstream operations (especially `spark_udf`) fail or produce wrong results. `input_example` is what `pyfunc.load_model` reads to build the wrapper's input coercer. + +--- + +## 9. `ai_query` used for batch inference on a custom UC model + +**Symptom:** you want batch inference on your custom-registered model. You see `ai_query()` in Genie docs and assume it works. It doesn't (for custom models) — `ai_query` only invokes **serving endpoints**, and your UC-registered model isn't behind one unless you deployed a serving endpoint for it. + +**Fix:** for batch inference, use `pyfunc.load_model` (notebook) or `pyfunc.spark_udf` (Lakeflow SDP pipeline). + +```python +# WRONG for custom UC models — requires a serving endpoint +spark.sql(f"SELECT ai_query('{MODEL_NAME}', features) FROM silver_features") + +# CORRECT — notebook batch (single node) +model = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") +predictions = model.predict(features_pandas_df) + +# CORRECT — Lakeflow SDP batch (distributed) +predict_udf = mlflow.pyfunc.spark_udf(spark, f"models:/{MODEL_NAME}@champion", result_type="double") +silver_features.withColumn("prediction", predict_udf(*feature_cols)) +``` + +**Why it bites:** `ai_query` *is* the right call for Foundation Model API endpoints (`ai_query('databricks-dbrx-instruct', prompt)`). The naming overlap leads to wrong assumptions for custom models. + +--- + +## 10. Trying to delete / re-register a model at the same version number + +**Symptom:** `RestException: ALREADY_EXISTS` when re-registering. You can't reuse version numbers. + +**Fix:** UC versions are monotonically-increasing and immutable. To supersede a bad version, register a new version and move `@champion` to it. The old version stays in history for lineage. + +```python +new_result = mlflow.register_model(new_run_uri, MODEL_NAME) +MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", new_result.version) +# Old version is still there; that's correct. Lineage preserved. +``` + +**Why it bites:** habits from the workspace registry (where deletion was forgiving) don't transfer. UC treats model versions as first-class auditable artifacts. + +--- + +## 11. `pyfunc.spark_udf` constructed inside a function call + +**Symptom:** in a Lakeflow SDP `@dp.materialized_view`, the UDF is constructed every time the view evaluates — slow and sometimes fails with serialization errors. + +**Fix:** construct the UDF at module scope, reuse it inside the view. + +```python +import mlflow +import databricks.declarative_pipelines as dp + +# Construct ONCE, at module scope +mlflow.set_registry_uri("databricks-uc") +predict_udf = mlflow.pyfunc.spark_udf( + spark, + f"models:/{MODEL_NAME}@champion", + result_type="double", +) + +@dp.materialized_view +def gold_forecast(): + return spark.read.table("silver_features").withColumn( + "prediction", + predict_udf("feat_a", "feat_b", "feat_c"), + ) +``` + +**Why it bites:** Lakeflow SDP may evaluate the function definition multiple times. Model deserialization is expensive — don't repeat it. + +--- + +## 12. Custom preprocessing not captured in the logged model + +**Symptom:** in the training notebook, predictions are accurate. After `pyfunc.load_model(...)`, predictions are garbage. The pipeline works in training because you're calling `scaler.transform()` manually; at inference time, nobody calls the scaler. + +**Fix:** wrap preprocessing + model in an `sklearn.pipeline.Pipeline` (or a custom `PythonModel` for non-sklearn preprocessing). Log the whole pipeline. + +```python +from sklearn.pipeline import Pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.ensemble import GradientBoostingRegressor + +pipeline = Pipeline([ + ("scaler", StandardScaler()), + ("model", GradientBoostingRegressor()), +]) +pipeline.fit(X_train, y_train) + +# Logs both the fitted scaler AND the model as a single artifact +mlflow.sklearn.log_model( + sk_model=pipeline, + artifact_path="model", + signature=infer_signature(X_train, pipeline.predict(X_train[:5])), + input_example=X_train.iloc[:5], +) +``` + +**Why it bites:** the most painful post-registration bug. Training and inference code paths are different files; the divergence is invisible until predictions are obviously wrong. diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md b/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md new file mode 100644 index 00000000..ed4d86ae --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md @@ -0,0 +1,244 @@ +# patterns-batch-inference + +Loading a UC-registered model and scoring features in batch. Two scales — interactive notebook (Pattern 1–2) and distributed Lakeflow pipeline (Patterns 3–4). Plus A/B validation (Pattern 5). + +--- + +## Pattern 1: Notebook batch inference — pandas path + +For interactive exploration, ad-hoc scoring, and sample sizes up to ~10k rows. + +```python +import mlflow + +mlflow.set_registry_uri("databricks-uc") + +model = mlflow.pyfunc.load_model( + "models:/my_catalog.my_schema.grocery_forecaster@champion" +) + +# Load a sample of features (LIMIT in SQL to avoid loading full table) +features = ( + spark.table("my_catalog.my_schema.silver_features") + .orderBy("month_date") + .limit(1000) + .toPandas() +) + +# The model's signature determines which columns it expects +feature_cols = model.metadata.get_input_schema().input_names() + +predictions = model.predict(features[feature_cols]) + +# Attach predictions for display/export +features["prediction"] = predictions +display(spark.createDataFrame(features)) +``` + +--- + +## Pattern 2: Notebook batch inference with chart + +Same pattern, adds a predicted-vs-actual visual. Useful as a demo artifact. + +```python +import matplotlib.pyplot as plt + +# (continuing from Pattern 1) +features_with_pred = features.sort_values("month_date") + +fig, ax = plt.subplots(figsize=(10, 5)) +ax.plot(features_with_pred["month_date"], features_with_pred["actual"], + label="Actual", linewidth=2) +ax.plot(features_with_pred["month_date"], features_with_pred["prediction"], + label="Predicted", linestyle="--", linewidth=2) +ax.set_xlabel("Month") +ax.set_ylabel("Turnover (millions)") +ax.set_title(f"Forecast — {model.metadata.run_id[:8]}") +ax.legend() +plt.xticks(rotation=45) +plt.tight_layout() +display(fig) +``` + +--- + +## Pattern 3: Lakeflow SDP batch via `spark_udf` + +For scheduled batch inference at scale. Distributes across Spark executors — no per-row Python overhead, no serving endpoint. + +```python +# src/gold/gold_forecast.py +import mlflow +import databricks.declarative_pipelines as dp + +# Construct the UDF ONCE at module scope — see GOTCHAS #11 +mlflow.set_registry_uri("databricks-uc") + +MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type="double", + env_manager="local", # "local" avoids conda/virtualenv setup overhead +) + +@dp.materialized_view( + comment="Grocery turnover forecast from @champion model", +) +def gold_forecast(): + return ( + spark.read.table("my_catalog.my_schema.silver_features") + .withColumn( + "forecast_turnover_millions", + predict_udf( + "turnover_lag_1", + "turnover_lag_12", + "rolling_3m_avg", + "state_share_of_national", + # ... pass each signature input column in the order the signature declares + ), + ) + ) +``` + +**What this gives you:** +- A `gold_forecast` table that refreshes on every pipeline run +- Distributed scoring (no serving endpoint, no auth token) +- Full UC lineage: `silver_features` → `gold_forecast` via `grocery_forecaster@champion` +- Genie can query it: *"what's the forecast for each state next month?"* + +--- + +## Pattern 4: `spark_udf` with `result_type` for multi-output models + +Multi-output regressors or classifiers need a richer result type. + +```python +from pyspark.sql.types import ArrayType, DoubleType, StructType, StructField + +# Multi-output regression — model returns 2 predictions per row +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type=ArrayType(DoubleType()), +) + +# Classifier with probabilities +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type=StructType([ + StructField("class", StringType(), True), + StructField("confidence", DoubleType(), True), + ]), +) +``` + +--- + +## Pattern 5: A/B validation — compare `@challenger` vs `@champion` + +Run both models on a validation set, compare error metrics, decide whether to promote. + +```python +import mlflow +from sklearn.metrics import mean_absolute_error, root_mean_squared_error + +mlflow.set_registry_uri("databricks-uc") +MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" + +champion = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") +challenger = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@challenger") + +# Hold-out validation set (not seen during training) +validation = spark.table(f"{MODEL_NAME.rsplit('.', 1)[0]}.validation_features").toPandas() +feature_cols = champion.metadata.get_input_schema().input_names() +actuals = validation["turnover_millions"] + +champion_preds = champion.predict(validation[feature_cols]) +challenger_preds = challenger.predict(validation[feature_cols]) + +print(f"Champion RMSE: {root_mean_squared_error(actuals, champion_preds):.2f}") +print(f"Challenger RMSE: {root_mean_squared_error(actuals, challenger_preds):.2f}") +print(f"Champion MAE: {mean_absolute_error(actuals, champion_preds):.2f}") +print(f"Challenger MAE: {mean_absolute_error(actuals, challenger_preds):.2f}") + +# Decision logic — promote if challenger beats champion by >2% +if root_mean_squared_error(actuals, challenger_preds) < root_mean_squared_error(actuals, champion_preds) * 0.98: + print("→ Promote @challenger. See patterns-uc-registration.md Pattern 5.") +else: + print("→ Keep @champion. Delete @challenger.") +``` + +--- + +## Pattern 6: Structured streaming inference + +For models scoring events as they arrive (not batch-scheduled). + +```python +from pyspark.sql.functions import col + +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type="double", +) + +events = ( + spark.readStream + .format("delta") + .table("my_catalog.my_schema.silver_events") +) + +scored = events.withColumn( + "prediction", + predict_udf(*[col(c) for c in feature_cols]), +) + +( + scored.writeStream + .format("delta") + .outputMode("append") + .option("checkpointLocation", "dbfs:/Volumes/my_catalog/my_schema/checkpoints/scoring") + .toTable("my_catalog.my_schema.gold_scored_events") +) +``` + +For most classic-ML batch use cases, Pattern 3 (Lakeflow SDP) is simpler. Use streaming only when event-time scoring matters. + +--- + +## What NOT to do for batch inference + +### Do not use `ai_query` for custom UC models + +`ai_query('', )` requires the model to be deployed as a **Model Serving endpoint**. UC-registered models are NOT automatically behind an endpoint. Use `pyfunc.load_model` (Pattern 1) or `pyfunc.spark_udf` (Pattern 3) instead. + +`ai_query` IS the right call for: +- Foundation Model API endpoints: `ai_query('databricks-dbrx-instruct', prompt)` +- Model Serving endpoints you've explicitly provisioned + +See `GOTCHAS.md` #9. + +### Do not use `mlflow.pyfunc.load_model` for billion-row batches on a single node + +Pattern 1 collects to pandas — fine up to ~10k rows, painful beyond ~100k, impossible for millions. For distributed scale, use Pattern 3 (`spark_udf`). + +### Do not construct `spark_udf` inside the function body + +See `GOTCHAS.md` #11. Construct once at module scope, reuse inside `@dp.materialized_view` / `@dp.table`. + +--- + +## Troubleshooting batch inference + +| Error | Cause | Fix | +|-------|-------|-----| +| `RESOURCE_DOES_NOT_EXIST` on load | Wrong registry URI or two-level name | `GOTCHAS.md` #1, #2 | +| Predictions are NaN | Input columns in wrong order | Pass columns in the order `model.metadata.get_input_schema().input_names()` declares | +| `PERMISSION_DENIED: EXECUTE ON MODEL` | No read access to model | `GRANT EXECUTE ON MODEL ... TO ` | +| `spark_udf` raises `PicklingError` | Model has un-picklable state (e.g., Spark session) | Re-train ensuring the model is pure Python/numpy — don't capture `spark` at training time | +| Pipeline hangs on `gold_forecast` | Model artifact is large; first load is slow | Normal — subsequent runs are fast (UDF is cached per executor) | +| Column type mismatch in Spark | UDF expects double; column is int/string | Cast explicitly: `col("feature").cast("double")` | diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md b/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md new file mode 100644 index 00000000..00c6e2ba --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md @@ -0,0 +1,141 @@ +# patterns-experiment-setup + +Experiments in UC-enforced workspaces need more setup than older MLflow guides show. The critical change: you must pin the experiment's `artifact_location` to a Unity Catalog volume, or `log_model` will fail with storage errors. + +--- + +## Pattern 1: Create experiment with UC volume artifact_location + +```python +import mlflow + +mlflow.set_registry_uri("databricks-uc") # always first + +# Prerequisite: the UC volume must exist +# CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts; + +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) +``` + +**Why both are required:** +- `experiment_name` — the workspace-visible path (browsable from the Experiments UI) +- `artifact_location` — where logged artifacts (model binaries, plots, datasets) physically live + +In older workspaces, `artifact_location` defaulted to DBFS root. UC-enforced workspaces reject DBFS root writes, so `log_model` fails with opaque errors like: + +``` +MlflowException: API request to endpoint /api/2.0/mlflow/runs/log-artifact failed +with error code 403 != 200. Response body: PERMISSION_DENIED ... +``` + +Pointing at a UC volume resolves this AND makes artifacts first-class-governed under UC lineage. + +--- + +## Pattern 2: Create the volume if it doesn't exist (idempotent) + +Run once per schema, before any experiment creation: + +```python +spark.sql(f""" + CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts + COMMENT 'MLflow experiment artifacts for forecasting models' +""") +``` + +Or via SQL editor: + +```sql +CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts; +``` + +**Permissions needed:** `USE SCHEMA` + `CREATE VOLUME`. If missing, request `CREATE VOLUME ON SCHEMA my_catalog.my_schema` from the schema owner. + +--- + +## Pattern 3: Experiment already exists, wrong `artifact_location` + +You can't retroactively change `artifact_location`. Three options, in order of preference: + +**Option A — New experiment** (cleanest, keeps old runs intact): +```python +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting_v2", # v2 suffix + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting_v2", +) +# New runs land in v2. Old runs stay in v1 (archive them if you like). +``` + +**Option B — Delete + recreate** (loses history; use only if no good runs exist): +```python +from mlflow import MlflowClient +client = MlflowClient() + +exp = client.get_experiment_by_name("/Users/me@company.com/forecasting") +client.delete_experiment(exp.experiment_id) + +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) +``` + +**Option C — Manual relocation of DBFS artifacts to UC volume**: do not do this. Storage paths are resolved at log time and encoded in the run's metadata; moving files doesn't update the pointers. + +--- + +## Pattern 4: Verify experiment is correctly configured + +After setup, before training: + +```python +exp = mlflow.get_experiment_by_name("/Users/me@company.com/forecasting") +assert exp is not None, "Experiment not created" +assert exp.artifact_location.startswith("dbfs:/Volumes/"), ( + f"artifact_location is not a UC volume: {exp.artifact_location}" +) +print(f"Experiment ID: {exp.experiment_id}") +print(f"Artifact location: {exp.artifact_location}") +``` + +If the assert fails, you have an old experiment pointing at DBFS root. Apply Pattern 3. + +--- + +## Pattern 5: Workspace-path vs Repo-path experiments + +MLflow accepts two conventions for `experiment_name`: + +```python +# Workspace-path convention (recommended for collaborative experiments) +mlflow.set_experiment(experiment_name="/Users/me@company.com/forecasting") + +# Repo-path convention (only if you're running from a Git folder) +mlflow.set_experiment(experiment_name="/Repos/me@company.com/my-repo/forecasting") +``` + +**Prefer workspace path** for experiments shared across pairs/teams. Repo-path experiments become orphans when the repo is deleted. + +**Both need `artifact_location` pointing at a UC volume.** The path convention only affects where the experiment metadata is browsable, not where artifacts live. + +--- + +## Pattern 6: Running from a notebook cell with autoselected experiment + +Databricks notebooks auto-associate runs with an experiment matching the notebook's workspace path: + +```python +# In a notebook at /Users/me@company.com/Notebooks/train.py +# Databricks will auto-set experiment_name to the notebook path +# BUT the default artifact_location is still DBFS root — you still need to override: + +mlflow.set_experiment( + experiment_name="/Users/me@company.com/Notebooks/train", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/train", +) +``` + +Or call `set_experiment` explicitly before the first `start_run` — the artifact_location fix must be applied regardless of notebook auto-association. diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-training.md b/databricks-skills/databricks-mlflow-ml/references/patterns-training.md new file mode 100644 index 00000000..017e3cfb --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/patterns-training.md @@ -0,0 +1,205 @@ +# patterns-training + +How to log classic ML models (sklearn / XGBoost / PyTorch) so they register cleanly and load correctly downstream. The two load-bearing decisions: `signature` and `input_example`. + +--- + +## Pattern 1: Baseline sklearn training loop + +```python +import mlflow +import mlflow.sklearn +from sklearn.ensemble import GradientBoostingRegressor +from sklearn.metrics import root_mean_squared_error, mean_absolute_error +from sklearn.model_selection import train_test_split +from mlflow.models import infer_signature + +mlflow.set_registry_uri("databricks-uc") +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) + +X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2) + +with mlflow.start_run(run_name="gbr_baseline"): + model = GradientBoostingRegressor(n_estimators=100, max_depth=3) + model.fit(X_train, y_train) + + # Signature + input_example are both load-bearing + signature = infer_signature(X_train, model.predict(X_train[:5])) + + mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=signature, + input_example=X_train.iloc[:5], + ) + + # Log everything needed to reproduce + mlflow.log_params({"n_estimators": 100, "max_depth": 3}) + predictions = model.predict(X_test) + mlflow.log_metrics({ + "rmse": root_mean_squared_error(y_test, predictions), + "mae": mean_absolute_error(y_test, predictions), + }) +``` + +--- + +## Pattern 2: Preprocessing + model as a Pipeline + +Always log preprocessing alongside the model. See `GOTCHAS.md` #12 — inference-time preprocessing drift is the most painful post-registration bug. + +```python +from sklearn.pipeline import Pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.compose import ColumnTransformer + +numeric_features = ["turnover_lag_1", "turnover_lag_12", "rolling_3m_avg"] +categorical_features = ["state", "industry"] + +preprocessor = ColumnTransformer([ + ("num", StandardScaler(), numeric_features), + ("cat", "passthrough", categorical_features), # handle in the model if needed +]) + +pipeline = Pipeline([ + ("preprocessor", preprocessor), + ("model", GradientBoostingRegressor(n_estimators=100)), +]) + +with mlflow.start_run(): + pipeline.fit(X_train, y_train) + + signature = infer_signature(X_train, pipeline.predict(X_train[:5])) + mlflow.sklearn.log_model( + sk_model=pipeline, # logs both preprocessor AND model as one artifact + artifact_path="model", + signature=signature, + input_example=X_train.iloc[:5], + ) +``` + +At inference time, callers never need to know about `StandardScaler` — they pass raw features, `pyfunc.load_model` dispatches through the pipeline. + +--- + +## Pattern 3: XGBoost / PyTorch — same interface, different flavor + +```python +# XGBoost +import mlflow.xgboost +import xgboost as xgb + +model = xgb.XGBRegressor(n_estimators=100, max_depth=3) +model.fit(X_train, y_train) + +with mlflow.start_run(): + mlflow.xgboost.log_model( + xgb_model=model, + artifact_path="model", + signature=infer_signature(X_train, model.predict(X_train[:5])), + input_example=X_train.iloc[:5], + ) + +# PyTorch +import mlflow.pytorch +import torch + +class Forecaster(torch.nn.Module): + ... + +model = Forecaster() +# ... training loop ... + +with mlflow.start_run(): + # For PyTorch, input_example must be a tensor or numpy array + example = X_train.iloc[:5].to_numpy() + mlflow.pytorch.log_model( + pytorch_model=model, + artifact_path="model", + signature=infer_signature(example, model(torch.tensor(example)).detach().numpy()), + input_example=example, + ) +``` + +--- + +## Pattern 4: Retraining — same experiment, new run + +Retraining for an A/B test or a scheduled refresh. Log to the same experiment; register as a new version in Workflow 2. + +```python +with mlflow.start_run(run_name="gbr_v2_with_seasonality") as run: + model = GradientBoostingRegressor(n_estimators=200, max_depth=4) + model.fit(X_train_with_seasonality, y_train) + + mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=infer_signature(X_train_with_seasonality, + model.predict(X_train_with_seasonality[:5])), + input_example=X_train_with_seasonality.iloc[:5], + ) + # Remember the run_id for the register step + print(f"New run: {run.info.run_id}") +``` + +--- + +## Pattern 5: Autologging (quick path for iteration) + +Autologging wraps `fit()` and logs params + metrics + model automatically. Convenient during experimentation; less explicit than manual logging. + +```python +mlflow.sklearn.autolog( + log_models=True, + log_input_examples=True, # IMPORTANT — otherwise no input_example is captured + log_model_signatures=True, # IMPORTANT — otherwise no signature is captured + silent=False, +) + +# Any subsequent fit() call auto-logs +model = GradientBoostingRegressor(n_estimators=100) +model.fit(X_train, y_train) +# Autolog handled the MLflow calls +``` + +**Caveat:** autologging infers signature + input_example heuristically. For production runs, prefer manual logging (Pattern 1) — you control what gets captured. + +--- + +## Pattern 6: Searching runs to pick the best one for registration + +Before registering, you typically want the best run from an experiment: + +```python +runs = mlflow.search_runs( + experiment_names=["/Users/me@company.com/forecasting"], + filter_string="metrics.rmse < 100 AND tags.mlflow.runName LIKE 'gbr_%'", + order_by=["metrics.rmse ASC"], + max_results=1, +) + +if runs.empty: + raise RuntimeError("No runs match criteria") + +best_run_id = runs.iloc[0]["run_id"] +best_rmse = runs.iloc[0]["metrics.rmse"] +print(f"Best run: {best_run_id} (RMSE={best_rmse:.2f})") + +# Now register this run's model — see patterns-uc-registration.md Pattern 1 +``` + +--- + +## Common logging mistakes + +| Mistake | Effect | Fix | +|---------|--------|-----| +| No `signature` | `pyfunc.load_model` works, but `.predict()` coerces wrong | Always call `infer_signature(X_train, y_hat[:5])` | +| No `input_example` | `pyfunc.load_model` can't introspect input schema | Pass `X_train.iloc[:5]` (or `.to_numpy()[:5]` for non-pandas) | +| `artifact_path` changes between logs | Same model name → different paths → broken load URIs | Always use `artifact_path="model"` | +| Log preprocessing separately | Inference callers must reapply preprocessing manually | Wrap in a sklearn `Pipeline` and log the pipeline | +| Use `pickle.dump` directly | Loses MLflow's flavor dispatch | Always use `mlflow..log_model` | diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md b/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md new file mode 100644 index 00000000..4d8929ed --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md @@ -0,0 +1,232 @@ +# patterns-uc-registration + +Register a logged model to Unity Catalog, set aliases, verify, and handle promotion / rollback. + +--- + +## Pattern 1: Explicit register from a specific run + +Cleanest workflow. Train (separate step) → pick best run → register. + +```python +import mlflow +from mlflow import MlflowClient + +mlflow.set_registry_uri("databricks-uc") + +MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" + +# run_id from a specific training run (see patterns-training.md Pattern 6) +run_id = "abc123def456" + +result = mlflow.register_model( + model_uri=f"runs:/{run_id}/model", + name=MODEL_NAME, + tags={ + "trained_by": "forecasting_team", + "dataset_version": "2024-Q4", + }, +) +print(f"Registered {MODEL_NAME} version {result.version}") +``` + +`result` is a `ModelVersion` object: +- `result.name` — fully qualified three-level name +- `result.version` — the new version (string, e.g., `"3"`) +- `result.status` — should be `"READY"` by the time this returns + +--- + +## Pattern 2: Log-and-register in one call + +Shorter but couples logging and registration. Use when you *know* the current run is the one worth registering. + +```python +with mlflow.start_run(): + model.fit(X_train, y_train) + mlflow.sklearn.log_model( + sk_model=model, + artifact_path="model", + signature=infer_signature(X_train, model.predict(X_train[:5])), + input_example=X_train.iloc[:5], + registered_model_name="my_catalog.my_schema.grocery_forecaster", + ) + # Model is registered as a new version; you still need to set alias separately. +``` + +**Still need a separate alias call** — `log_model` doesn't set aliases. + +--- + +## Pattern 3: Set aliases (`@champion`, `@challenger`) + +Aliases decouple the loader from the version. Moving `@champion` to a new version silently updates every `models:/...@champion` loader. + +```python +from mlflow import MlflowClient +client = MlflowClient() + +# Set or move an alias +client.set_registered_model_alias( + name="my_catalog.my_schema.grocery_forecaster", + alias="champion", + version=result.version, +) +``` + +**Conventions:** +- `@champion` — the current production winner. Exactly one version at a time. +- `@challenger` — a candidate under evaluation. Exactly one at a time. +- Custom aliases — free-form, e.g., `@pair_team_07`, `@nightly`, `@reviewed`. + +**Read existing aliases:** +```python +model = client.get_registered_model("my_catalog.my_schema.grocery_forecaster") +print(model.aliases) # e.g., {"champion": "3", "challenger": "4"} +``` + +**Delete an alias:** +```python +client.delete_registered_model_alias( + name="my_catalog.my_schema.grocery_forecaster", + alias="challenger", +) +``` + +--- + +## Pattern 4: Verify registration (Navigator's V-step) + +Don't trust `register_model`'s success message alone. See `GOTCHAS.md` #5. + +### Via SQL + +```sql +DESCRIBE MODEL my_catalog.my_schema.grocery_forecaster; +``` + +Expected output includes the model metadata and (if set) aliases. If the result is "table or view not found," the model didn't register to UC — check `set_registry_uri` (GOTCHAS #1). + +### Via Catalog Explorer UI + +1. Open Catalog Explorer +2. Navigate to `my_catalog` → `my_schema` → **Models** tab +3. Confirm `grocery_forecaster` appears with an `@champion` badge + +If the model appears under the workspace MLflow icon instead (left sidebar, under MLflow), you registered to the workspace registry. See GOTCHAS #1. + +### Via Python assertion (scriptable) + +```python +from mlflow import MlflowClient +client = MlflowClient() + +model = client.get_registered_model("my_catalog.my_schema.grocery_forecaster") + +# Three assertions that should always hold post-registration +assert model is not None, "Model not registered to UC" +assert len(model.latest_versions) > 0, "No versions exist" +assert "champion" in model.aliases, "@champion alias not set" +print(f"✓ {model.name} v{model.aliases['champion']} is @champion") +``` + +--- + +## Pattern 5: A/B promotion — swap `@challenger` to `@champion` + +You've trained a new version, registered it, and validated its predictions against the current champion. Now promote: + +```python +client = MlflowClient() +MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" + +# Get current state +model = client.get_registered_model(MODEL_NAME) +old_champion = model.aliases.get("champion") +new_champion = model.aliases.get("challenger") + +if new_champion is None: + raise RuntimeError("No @challenger set — nothing to promote") + +# Move the alias (atomic — downstream loaders see the switch on next load) +client.set_registered_model_alias(MODEL_NAME, "champion", new_champion) + +# Optional: archive the old champion version with a custom alias +if old_champion: + client.set_registered_model_alias(MODEL_NAME, f"archived_{old_champion}", old_champion) + +# Remove the @challenger alias +client.delete_registered_model_alias(MODEL_NAME, "challenger") + +print(f"Promoted v{new_champion} from @challenger to @champion (was v{old_champion})") +``` + +**Rollback** is the inverse — move `@champion` back to the previous version. + +--- + +## Pattern 6: List all model versions + +Useful for lineage inspection or cleanup. + +```sql +SHOW MODEL VERSIONS ON MODEL my_catalog.my_schema.grocery_forecaster; +``` + +Or via Python: +```python +from mlflow import MlflowClient +client = MlflowClient() + +versions = client.search_model_versions( + filter_string=f"name='my_catalog.my_schema.grocery_forecaster'", + order_by=["version_number DESC"], +) +for v in versions: + print(f"v{v.version}: run_id={v.run_id}, status={v.status}, aliases={v.aliases}") +``` + +--- + +## Pattern 7: Tags — richer metadata without new versions + +Tags are key-value metadata on the registered model (or a specific version). Useful for: +- Team ownership: `set_model_version_tag(name, "1", "team", "forecasting")` +- Dataset provenance: `set_model_version_tag(name, "1", "dataset_version", "2024-Q4")` +- Review status: `set_model_version_tag(name, "1", "reviewed", "true")` + +```python +from mlflow import MlflowClient +client = MlflowClient() + +# Tag on the registered model (applies to all versions) +client.set_registered_model_tag( + name="my_catalog.my_schema.grocery_forecaster", + key="domain", + value="retail", +) + +# Tag on a specific version +client.set_model_version_tag( + name="my_catalog.my_schema.grocery_forecaster", + version="3", + key="reviewed_by", + value="jane@company.com", +) +``` + +Tags are queryable via `search_model_versions(filter_string="tags.reviewed = 'true'")`. + +--- + +## Permission requirements + +| Operation | Permission needed | Granted via | +|-----------|-------------------|-------------| +| `register_model` (first version of a model) | `CREATE MODEL ON SCHEMA ` | `GRANT CREATE MODEL ON SCHEMA ... TO ...` | +| `register_model` (new version of existing) | `EDIT ON MODEL ` | Automatic for model owner; otherwise grant | +| `set_registered_model_alias` | `EDIT ON MODEL ` | Same as above | +| `get_registered_model` / `DESCRIBE MODEL` | `USE CATALOG` + `USE SCHEMA` + `EXECUTE ON MODEL` | Standard read grants | +| `load_model` | `EXECUTE ON MODEL ` | `GRANT EXECUTE ON MODEL ... TO ...` | + +If any of these fail, request the specific grant from the schema owner. See `GOTCHAS.md` #7. diff --git a/databricks-skills/databricks-mlflow-ml/references/user-journeys.md b/databricks-skills/databricks-mlflow-ml/references/user-journeys.md new file mode 100644 index 00000000..a72f9106 --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/user-journeys.md @@ -0,0 +1,195 @@ +# user-journeys + +End-to-end workflows with decision points. Read the journey that matches your situation. + +--- + +## Journey 1: First model (train → register → score) — the 90%-case + +Most users arrive here. Goal: a UC-registered model with a `@champion` alias, producing batch predictions. + +**Prerequisites:** +- UC catalog + schema where you have `CREATE MODEL` permission +- A UC volume for MLflow artifacts (create if missing — `patterns-experiment-setup.md` Pattern 2) +- Features in a Spark table (Bronze → Silver → Gold already done) + +**Steps:** + +1. **Set up the experiment** (`patterns-experiment-setup.md` Pattern 1) + - `mlflow.set_registry_uri("databricks-uc")` + - `mlflow.set_experiment(experiment_name=..., artifact_location=)` +2. **Train + log** (`patterns-training.md` Pattern 1 or 2) + - Always include `signature` and `input_example` + - If you have preprocessing, wrap in `sklearn.Pipeline` (Pattern 2) +3. **Register** (`patterns-uc-registration.md` Pattern 1) + - `mlflow.register_model(f"runs:/{run_id}/model", "catalog.schema.model")` +4. **Set alias** (`patterns-uc-registration.md` Pattern 3) + - `client.set_registered_model_alias(name, "champion", version)` +5. **Verify** (`patterns-uc-registration.md` Pattern 4) + - `DESCRIBE MODEL catalog.schema.model` OR Catalog Explorer UI +6. **Load + score** (`patterns-batch-inference.md` Pattern 1 or 2) + - `model = mlflow.pyfunc.load_model("models:/catalog.schema.model@champion")` + - `model.predict(features_df)` + +**Done.** You have a UC-registered model with a canonical loading URI that downstream code can depend on. + +--- + +## Journey 2: Retrain + promote (A/B) + +You already have `@champion`. You trained a new version and want to decide whether to promote it. + +**Prerequisites:** +- Model exists in UC with `@champion` set (you did Journey 1) +- New training run logged to the same experiment + +**Steps:** + +1. **Register new version** (`patterns-uc-registration.md` Pattern 1) + - Same `MODEL_NAME` as before — UC auto-increments version +2. **Set `@challenger`** (`patterns-uc-registration.md` Pattern 3) + - `client.set_registered_model_alias(name, "challenger", new_version)` +3. **A/B validate** (`patterns-batch-inference.md` Pattern 5) + - Load both aliases, score validation set, compare metrics +4. **Decide**: + - Challenger wins → **Pattern 5 in `patterns-uc-registration.md`**: swap aliases + - Champion wins → delete `@challenger` alias, keep current `@champion` +5. **Verify** downstream loaders picked up the new version (after swap) + - Any code using `models:/@champion` will see the new version on next load + +--- + +## Journey 3: Lakeflow SDP batch pipeline + +You want predictions to land in a scheduled gold table, not an ad-hoc notebook. + +**Prerequisites:** +- Model registered with `@champion` (Journey 1 complete) +- Lakeflow SDP pipeline defined (one already running is ideal) + +**Steps:** + +1. **Add a new file** to the pipeline source: `src/gold/gold_forecast.py` +2. **Construct the UDF at module scope** (`patterns-batch-inference.md` Pattern 3) + - `mlflow.set_registry_uri("databricks-uc")` + - `predict_udf = mlflow.pyfunc.spark_udf(spark, "models:/...@champion", result_type="double")` +3. **Define the `@dp.materialized_view`** that reads silver features, applies the UDF +4. **Deploy + run** the pipeline + - `databricks bundle deploy && databricks bundle run ` +5. **Verify** the `gold_forecast` table materializes + - Row count matches `silver_features` + - Query from Genie or SQL editor + +**Do NOT use `ai_query`** in this pipeline — see `GOTCHAS.md` #9. + +--- + +## Journey 4: Debug a registration that went to workspace registry + +The #1 support question. Symptoms: model doesn't appear in Catalog Explorer; URL contains `/ml/models/` instead of `/explore/data/models/`. + +**Steps:** + +1. Confirm the diagnosis: + - Catalog Explorer → catalog → schema → Models tab: **missing** + - MLflow icon (left sidebar) → Models: **present** + - That's the workspace registry, not UC +2. Verify registry URI in the training session + - `mlflow.get_registry_uri()` — should return `"databricks-uc"`, not a workspace URI +3. If the URI was wrong, fix it and re-register: + - Add `mlflow.set_registry_uri("databricks-uc")` at the top of the training code + - Re-run `mlflow.register_model(...)` — this creates a new entry in UC + - The orphaned workspace-registry entry can be deleted via MLflow UI (optional) +4. Set the `@champion` alias on the new UC version +5. Verify via `DESCRIBE MODEL` — see `patterns-uc-registration.md` Pattern 4 + +--- + +## Journey 5: Debug a `pyfunc.load_model` that fails or predicts wrong + +Model loaded successfully, but `.predict()` raises or produces nonsense. + +**Steps:** + +1. **Check the signature was logged:** + ```python + from mlflow.models import get_model_info + info = get_model_info("models:/@champion") + print(info.signature) + ``` + If `None` — see `GOTCHAS.md` #8. Re-log the model with `signature=infer_signature(...)`. + +2. **Check the input column order:** + ```python + expected = model.metadata.get_input_schema().input_names() + print(f"Model expects: {expected}") + print(f"You passed: {list(features_df.columns)}") + ``` + If the order differs, pass `features_df[expected]`. + +3. **Check preprocessing coverage:** + - Does the training notebook call a scaler / encoder / imputer before fitting? + - Is that preprocessing in the logged artifact? + - If not — see `GOTCHAS.md` #12. Re-train with preprocessing wrapped in `sklearn.Pipeline`. + +4. **Check for type coercion:** + - Integer column becoming float (or vice versa) — fine for sklearn, sometimes breaks for xgboost/pytorch + - Categorical as string vs int — depends on the flavor + - Fix: cast `features_df` to match `model.metadata.get_input_schema()` dtypes before predicting + +--- + +## Journey 6: Schema evolution — your features changed since the model was logged + +The silver features pipeline added a new column. Your deployed `@champion` model was trained without it. Predictions still work (extra columns are ignored), but you want to include the new feature. + +**Steps:** + +1. Retrain with the new feature: + ```python + # Same Journey 1 steps, but with expanded feature set + mlflow.sklearn.log_model( + sk_model=new_pipeline, + artifact_path="model", + signature=infer_signature(X_train_expanded, new_pipeline.predict(X_train_expanded[:5])), + input_example=X_train_expanded.iloc[:5], + ) + ``` +2. Register as a new version +3. Validate via A/B (Journey 2) +4. Promote to `@champion` + +Schema changes are always a new version. Never mutate a logged model in place. + +--- + +## Journey 7: "Everything is on fire, I have 10 minutes to demo" + +Someone registered a fallback model. Load it. + +```python +import mlflow +mlflow.set_registry_uri("databricks-uc") +model = mlflow.pyfunc.load_model( + "models:/..@fallback" +) +features = spark.table("..sample_features").limit(500).toPandas() +features["prediction"] = model.predict(features) +display(spark.createDataFrame(features)) +``` + +Every escape-hatch pattern should pre-register a `@fallback` version for exactly this case. + +--- + +## When to use which journey + +| Situation | Journey | +|-----------|---------| +| I'm starting from zero | 1 | +| I have `@champion`, trained something new | 2 | +| I want predictions in a scheduled table | 3 | +| Registered but can't find in Catalog Explorer | 4 | +| `load_model` succeeds but `predict` fails | 5 | +| My features changed | 6 | +| Demo in 10 minutes, nothing works | 7 | From deb6a30a3f850245ae09bea0d7dd3e481cb5cefa Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sun, 19 Apr 2026 23:04:18 +1000 Subject: [PATCH 2/5] docs(mlflow-ml): add two gotchas from real-world test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Field-tested the skill end-to-end from a local Python environment against a live Databricks workspace. Surfaced two gotchas not in the original set: #12 mlflow[databricks] extras missing when running outside Databricks: plain `pip install mlflow` omits azure-core / boto3 / google.cloud SDKs that UC registration needs to stage artifacts. Training + log_model work; register_model fails with opaque "No module named 'azure'". Databricks clusters ship the extras pre-installed, so this only bites laptops / CI. #13 artifact_path= deprecated in favour of name= (MLflow 2.16+): emits warning on every log_model call. Non-blocking, but worth flagging since most online tutorials + training courses still use the old param. Both verified against the workshop's test run — skill workflow 1 now completes cleanly with these fixes documented. --- .../references/GOTCHAS.md | 40 ++++++++++++++++++- 1 file changed, 38 insertions(+), 2 deletions(-) diff --git a/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md b/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md index 92615de2..a2ab11d4 100644 --- a/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md +++ b/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md @@ -1,6 +1,6 @@ # GOTCHAS — Classic ML on MLflow + Unity Catalog -Twelve mistakes that silently waste hours. Read before writing any code. +Fourteen mistakes that silently waste hours. Read before writing any code. --- @@ -236,7 +236,43 @@ def gold_forecast(): --- -## 12. Custom preprocessing not captured in the logged model +## 12. `mlflow[databricks]` extras missing when running outside Databricks + +**Symptom:** training + logging works; `register_model` fails with `MlflowException: Unable to import necessary dependencies to access model version files in Unity Catalog` — root cause `ModuleNotFoundError: No module named 'azure'` (for Azure-hosted workspaces) or `'boto3'` (AWS) / `'google.cloud'` (GCP). + +**Fix:** install the `databricks` extras, which pull cloud-storage SDKs MLflow needs to stage artifacts into the UC-managed location. + +```bash +pip install 'mlflow[databricks]' +# or, for a lighter install: +pip install 'mlflow-skinny[databricks]' +``` + +**Why it bites:** plain `pip install mlflow` leaves out the cloud-provider SDKs because they're large and most local workflows don't need them. UC registration REQUIRES them because the registry stages artifacts into cloud-managed storage (Azure ADLS / S3 / GCS), and MLflow uses the provider's SDK for the upload. Local `log_model` works fine (artifacts go to the tracking server); registration doesn't. + +**When it most commonly hits:** running training scripts from a laptop, CI runner, or non-Databricks compute — anywhere that isn't a Databricks cluster (which ships the extras pre-installed). + +--- + +## 13. `artifact_path=` parameter is deprecated; new name is `name=` + +**Symptom:** warning in logs: `WARNING mlflow.models.model: `artifact_path` is deprecated. Please use `name` instead.` Still works today; may break in a future MLflow major version. + +**Fix:** use `name=` instead of `artifact_path=` in `log_model` calls. + +```python +# OLD (still works, warns) +mlflow.sklearn.log_model(sk_model=model, artifact_path="model", ...) + +# NEW (preferred, no warning) +mlflow.sklearn.log_model(sk_model=model, name="model", ...) +``` + +**Why it bites:** most online tutorials and training courses still use `artifact_path`. The rename shipped in MLflow 2.16. `name=` semantics are identical — still the within-run artifact folder. Aliases this to the preferred parameter, not a rename of what the parameter represents. + +--- + +## 14. Custom preprocessing not captured in the logged model **Symptom:** in the training notebook, predictions are accurate. After `pyfunc.load_model(...)`, predictions are garbage. The pipeline works in training because you're calling `scaler.transform()` manually; at inference time, nobody calls the scaler. From cf211958d34aa090f205345cc244f17941828abd Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sun, 19 Apr 2026 23:10:15 +1000 Subject: [PATCH 3/5] =?UTF-8?q?docs(mlflow-ml):=20runtime=20claim=20?= =?UTF-8?q?=E2=80=94=20MLflow=203.11=20on=20serverless=20compute=20v5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Original SKILL.md didn't state a runtime target. Adds a "Runtime compatibility" section anchored on what the skill was actually tested against — MLflow 3.11 on Lakeflow SDP serverless compute v5 — with a compat note for MLflow 2.16+ (classic DBR 15.4 LTS still ships 2.x). Points at GOTCHAS.md for the 3.x-vs-2.x divergence (artifact_path deprecation, etc.). --- databricks-skills/databricks-mlflow-ml/SKILL.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/databricks-skills/databricks-mlflow-ml/SKILL.md b/databricks-skills/databricks-mlflow-ml/SKILL.md index 43d4a2ed..cb3f7d0b 100644 --- a/databricks-skills/databricks-mlflow-ml/SKILL.md +++ b/databricks-skills/databricks-mlflow-ml/SKILL.md @@ -123,3 +123,7 @@ If you're training a forecasting / classification / regression model, registerin - [`patterns-uc-registration.md`](references/patterns-uc-registration.md) — register + alias + verify + A/B promotion - [`patterns-batch-inference.md`](references/patterns-batch-inference.md) — notebook (`pyfunc.load_model`) + Lakeflow (`spark_udf`) + champion-vs-challenger - [`user-journeys.md`](references/user-journeys.md) — end-to-end workflows with decision points + +## Runtime compatibility + +Patterns verified against **MLflow 3.11** on **Lakeflow SDP serverless compute version 5** (default at time of writing). All APIs used (`set_registry_uri`, `log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf`) are compatible with MLflow 2.16+ as well, so the patterns work on older classic Databricks Runtimes that still ship 2.x. Where 3.x behaviour diverges (e.g., `artifact_path` deprecation → use `name=`), GOTCHAS.md calls it out. From 1a4a608d75a87488e7287f4b92ab96b9057dc8e7 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sat, 9 May 2026 15:26:17 +1000 Subject: [PATCH 4/5] docs(mlflow-ml): densify per Quentin's audit (gpt-5.5 in logfood) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Quentin posted a Claude-generated audit on PR #474 specifying the restructure. Ran gpt-5.5 in logfood with the audit as the spec. Changes: 8 files / 1,666 lines → 3 files / 485 lines (71% reduction). Structure: - SKILL.md (91 lines) — frontmatter, 3-skill comparison table, hard rules, Quick Start, decision table for situation→recipe routing, read-order instruction at top, negative list ("don't read X-pattern.md for sklearn 101"). - references/gotchas.md (161 lines) — only Databricks/UC-specific failures: silently-wrong workspace registry, three-level UC names, artifact_location UC volume in UC-enforced workspaces, alias-on-stage no-op, CREATE MODEL ON SCHEMA grant, ai_query vs custom-model batch, spark_udf module-scope in Lakeflow SDP, mlflow[databricks] extras, artifact_path→name deprecation. Each entry: symptom + silent/loud + fix + one-sentence why. - references/recipes.md (233 lines) — UC-specific code shapes only: experiment + UC volume setup, log→register→alias canonical pattern, Lakeflow SDP spark_udf module-scope, A/B alias swap order, verification one-liners. Deleted (per Quentin's audit): - references/CRITICAL-interfaces.md (90% plain MLflow API) - references/GOTCHAS.md (replaced by lowercase gotchas.md, dropping the generic entries: alias-not-version, verify-after-register, signature basics, version reuse, Pipeline preprocessing — all generic MLflow / sklearn knowledge) - references/user-journeys.md (pure pointer-shuffling) - references/patterns-experiment-setup.md - references/patterns-training.md - references/patterns-uc-registration.md - references/patterns-batch-inference.md Workflow tables in SKILL.md replaced by a 6-row decision table. Common Issues table consolidated into gotchas.md. Reference Files list dropped — Claude can ls. Co-authored-by: Isaac --- .../databricks-mlflow-ml/SKILL.md | 136 ++++----- .../references/CRITICAL-interfaces.md | 219 --------------- .../references/GOTCHAS.md | 260 ++++-------------- .../references/patterns-batch-inference.md | 244 ---------------- .../references/patterns-experiment-setup.md | 141 ---------- .../references/patterns-training.md | 205 -------------- .../references/patterns-uc-registration.md | 232 ---------------- .../references/recipes.md | 233 ++++++++++++++++ .../references/user-journeys.md | 195 ------------- 9 files changed, 342 insertions(+), 1523 deletions(-) delete mode 100644 databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md delete mode 100644 databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md delete mode 100644 databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md delete mode 100644 databricks-skills/databricks-mlflow-ml/references/patterns-training.md delete mode 100644 databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md create mode 100644 databricks-skills/databricks-mlflow-ml/references/recipes.md delete mode 100644 databricks-skills/databricks-mlflow-ml/references/user-journeys.md diff --git a/databricks-skills/databricks-mlflow-ml/SKILL.md b/databricks-skills/databricks-mlflow-ml/SKILL.md index cb3f7d0b..26286e8c 100644 --- a/databricks-skills/databricks-mlflow-ml/SKILL.md +++ b/databricks-skills/databricks-mlflow-ml/SKILL.md @@ -5,125 +5,87 @@ description: "Classic ML model lifecycle on Databricks with MLflow and Unity Cat # MLflow + Unity Catalog — Classic ML -## Before Writing Any Code +Read this file fully; consult `references/gotchas.md` before writing UC code; consult `references/recipes.md` only for the alias-swap and `spark_udf` patterns. -1. **Read `GOTCHAS.md`** — 12 common mistakes that cause silent failures or wasted time -2. **Read `CRITICAL-interfaces.md`** — exact API signatures and the `models:/` URI format +If you're tempted to read `patterns-training.md`, `patterns-experiment-setup.md`, `patterns-uc-registration.md`, or `patterns-batch-inference.md` to figure out basic sklearn training, stop — you don't need them. This skill is only about the Databricks / Unity Catalog parts that are easy to miss. -## End-to-End Workflows - -Follow the workflow that matches your goal. Each step indicates which reference files to read. - -### Workflow 1: Train → Register → Batch Score (most common) - -For building a production-shape classic ML model with UC-native lineage. Covers the full path from raw features to predictions in a downstream table. - -| Step | Action | Reference Files | -|------|--------|-----------------| -| 1 | Create experiment with UC volume artifact_location | `patterns-experiment-setup.md` (Pattern 1) | -| 2 | Train model with signature + input_example | `patterns-training.md` (Patterns 1–3) | -| 3 | Register to Unity Catalog with three-level name | `patterns-uc-registration.md` (Patterns 1–2) | -| 4 | Set `@champion` alias | `patterns-uc-registration.md` (Pattern 3) | -| 5 | Verify registration (Navigator check) | `patterns-uc-registration.md` (Pattern 4) + `GOTCHAS.md` #5 | -| 6 | Load + score in notebook (Tier 1) | `patterns-batch-inference.md` (Patterns 1–2) | -| 7 | Optional: Lakeflow SDP batch via `spark_udf` | `patterns-batch-inference.md` (Patterns 3–4) | - -### Workflow 2: Retrain + Promote (A/B pattern) - -For adding a new version of an already-registered model and promoting it without touching downstream loader code. +## Why This Skill Exists -| Step | Action | Reference Files | -|------|--------|-----------------| -| 1 | Train new version, log to same UC model name | `patterns-training.md` (Pattern 4) | -| 2 | Register as new version | `patterns-uc-registration.md` (Pattern 2) | -| 3 | Set `@challenger` alias | `patterns-uc-registration.md` (Pattern 3) | -| 4 | Validate `@challenger` predictions vs `@champion` | `patterns-batch-inference.md` (Pattern 5) | -| 5 | Swap aliases (`@challenger` → `@champion`) | `patterns-uc-registration.md` (Pattern 5) | +Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**. -Downstream loader code that uses `models:/catalog.schema.model@champion` picks up the new version on next load — no code change needed. +| Skill | Scope | MLflow API Surface | +|-------|-------|--------------------| +| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces | +| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` | +| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` | -### Workflow 3: Debugging a Failed Registration or Load +Use this skill when training forecasting / classification / regression models, registering them to Unity Catalog, and scoring them in a notebook or Lakeflow pipeline. Do not use it for GenAI evaluation or Model Serving endpoint management. -For the two most common support questions: "why did my model go to workspace registry?" and "why does pyfunc.load_model fail?" +## Hard Rules -| Step | Action | Reference Files | -|------|--------|-----------------| -| 1 | Verify registry URI is set to `databricks-uc` | `GOTCHAS.md` #1 | -| 2 | Verify three-level name | `GOTCHAS.md` #2 | -| 3 | Confirm model appears in Catalog Explorer | `patterns-uc-registration.md` (Pattern 4) | -| 4 | Check `CREATE MODEL` permissions | `GOTCHAS.md` #7 | -| 5 | Diagnose load failures | `GOTCHAS.md` #3, #8, #11 | +1. Call `mlflow.set_registry_uri("databricks-uc")` before registering or loading UC models. +2. UC model names are always three-level: `catalog.schema.model_name`. +3. Load by alias, not version: `models:/catalog.schema.model@champion`, not `models:/catalog.schema.model/3`. +4. In UC-enforced workspaces, experiments need `artifact_location="dbfs:/Volumes////"`. +5. `register_model` creates a version; it does **not** set `@champion` or `@challenger`. +6. Use aliases for lifecycle. Legacy stages like `Production` / `Staging` are deprecated for UC models. ## Quick Start -The minimum viable path from untrained model to UC-registered, notebook-scored: +Minimum viable path from trained model object to UC-registered, notebook-scored model: ```python import mlflow -from mlflow.models import infer_signature +import mlflow.sklearn from mlflow import MlflowClient +from mlflow.models import infer_signature + +CATALOG = "my_catalog" +SCHEMA = "my_schema" +MODEL_NAME = f"{CATALOG}.{SCHEMA}.my_model" -# 1. Configure: UC registry + UC volume for artifacts (both required) +# 1. Configure UC registry + UC volume-backed experiment. mlflow.set_registry_uri("databricks-uc") mlflow.set_experiment( experiment_name="/Users/me@company.com/forecasting", - artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", + artifact_location=f"dbfs:/Volumes/{CATALOG}/{SCHEMA}/mlflow_artifacts/forecasting", ) -# 2. Train + log +# 2. Train + log. Use name="model" in MLflow 3.x; artifact_path="model" only for older code. with mlflow.start_run() as run: model.fit(X_train, y_train) signature = infer_signature(X_train, model.predict(X_train[:5])) + mlflow.sklearn.log_model( - sk_model=model, - artifact_path="model", + sk_model=model, # log the full Pipeline if preprocessing exists + name="model", signature=signature, input_example=X_train.iloc[:5], ) -# 3. Register + alias -MODEL_NAME = "my_catalog.my_schema.my_model" -result = mlflow.register_model(f"runs:/{run.info.run_id}/model", MODEL_NAME) +# 3. Register + set alias. register_model returns a ModelVersion; alias is a separate call. +result = mlflow.register_model( + model_uri=f"runs:/{run.info.run_id}/model", + name=MODEL_NAME, +) MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", result.version) -# 4. Load + predict (in any notebook, anywhere) -model = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") -predictions = model.predict(X_test) +# 4. Load by alias, never by hard-coded version. +loaded = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") +predictions = loaded.predict(X_test) ``` -## Why This Skill Exists - -Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**. The distinction matters because the APIs diverged: - -| Skill | Scope | MLflow API Surface | -|-------|-------|--------------------| -| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces | -| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` | -| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` | - -If you're training a forecasting / classification / regression model, registering it to UC, and scoring it in a notebook or Lakeflow pipeline — this skill. If you're evaluating an LLM agent's output quality — evaluation skill. If you're exposing a model behind an HTTP endpoint — model-serving skill. - -## Common Issues - -| Issue | Solution | -|-------|----------| -| **Model registered but not visible in Catalog Explorer** | Missing `mlflow.set_registry_uri("databricks-uc")`. See `GOTCHAS.md` #1. | -| **`RestException: INVALID_PARAMETER_VALUE` on `register_model`** | Two-level name used. UC requires `catalog.schema.name`. See `GOTCHAS.md` #2. | -| **Experiment creation fails with storage errors** | Missing `artifact_location` pointing at a UC volume. See `GOTCHAS.md` #4. | -| **`PERMISSION_DENIED: CREATE MODEL`** | Pair/user needs `CREATE MODEL ON SCHEMA `. See `GOTCHAS.md` #7. | -| **`pyfunc.load_model` returns but `predict()` fails** | Signature wasn't logged; inputs don't coerce. See `GOTCHAS.md` #8. | -| **Agent proposes `ai_query` for batch inference** | Wrong primitive — that requires a serving endpoint. Use `pyfunc.load_model` or `spark_udf`. See `GOTCHAS.md` #9. | - -## Reference Files +## Decision Table -- [`GOTCHAS.md`](references/GOTCHAS.md) — 12 common mistakes + fixes -- [`CRITICAL-interfaces.md`](references/CRITICAL-interfaces.md) — API signatures + `models:/` URI format -- [`patterns-experiment-setup.md`](references/patterns-experiment-setup.md) — experiment creation with UC volume artifact_location -- [`patterns-training.md`](references/patterns-training.md) — logging models with signature + input_example + autologging -- [`patterns-uc-registration.md`](references/patterns-uc-registration.md) — register + alias + verify + A/B promotion -- [`patterns-batch-inference.md`](references/patterns-batch-inference.md) — notebook (`pyfunc.load_model`) + Lakeflow (`spark_udf`) + champion-vs-challenger -- [`user-journeys.md`](references/user-journeys.md) — end-to-end workflows with decision points +| Situation | Do this | +|-----------|---------| +| Starting a first UC-registered classic ML model | Quick Start, then `recipes.md` §1–2; check `gotchas.md` #1, #2, #4, #7 | +| Model registered but missing from Catalog Explorer | Diagnose `set_registry_uri` and three-level names in `gotchas.md` #1–2 | +| Need notebook batch scoring | Use `mlflow.pyfunc.load_model("models:/catalog.schema.model@champion")`; keep the alias rule above | +| Need scheduled / distributed batch scoring in Lakeflow SDP | Use `recipes.md` §3 and `gotchas.md` #11; construct `spark_udf` at module scope | +| Retrained a challenger and need promotion | Use `recipes.md` §4 exactly; delete old `@champion` before setting new `@champion` | +| Load or predict behaves oddly | Use `recipes.md` §5 for `get_model_info` / signature checks, then `gotchas.md` for UC-specific failures | -## Runtime compatibility +## Runtime Compatibility -Patterns verified against **MLflow 3.11** on **Lakeflow SDP serverless compute version 5** (default at time of writing). All APIs used (`set_registry_uri`, `log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf`) are compatible with MLflow 2.16+ as well, so the patterns work on older classic Databricks Runtimes that still ship 2.x. Where 3.x behaviour diverges (e.g., `artifact_path` deprecation → use `name=`), GOTCHAS.md calls it out. +MLflow 3.x prefers `name=` in `log_model`; MLflow 2.x examples often use `artifact_path=`, which works but warns in newer versions. UC model stages are deprecated across modern Databricks runtimes; use aliases. diff --git a/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md b/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md deleted file mode 100644 index a40483c5..00000000 --- a/databricks-skills/databricks-mlflow-ml/references/CRITICAL-interfaces.md +++ /dev/null @@ -1,219 +0,0 @@ -# CRITICAL-interfaces — Exact API signatures - -The minimum set of APIs that every classic-ML + UC workflow touches. Copy-pasteable, with the exact arguments that matter. - ---- - -## Registry URI configuration - -```python -mlflow.set_registry_uri("databricks-uc") # Call at the start of every session -mlflow.get_registry_uri() # Returns "databricks-uc" if set correctly -``` - -**Must be called BEFORE** any `register_model` or `load_model` call. Idempotent to repeat. - ---- - -## Experiment creation with UC volume artifact_location - -```python -mlflow.set_experiment( - experiment_name="/Users//", - artifact_location="dbfs:/Volumes////", -) -``` - -**`artifact_location` is required** for UC-enforced workspaces. The volume must exist: - -```sql -CREATE VOLUME IF NOT EXISTS ..; -``` - ---- - -## `models:/` URI format - -All load / deploy / spark_udf calls use this URI. **One format to memorize:** - -``` -models:/..@ -``` - -Examples: -``` -models:/my_catalog.my_schema.grocery_forecaster@champion -models:/my_catalog.my_schema.grocery_forecaster@challenger -``` - -**Avoid** these forms (either legacy, or not-UC-native): -``` -models:/grocery_forecaster/3 # workspace registry, version number -models:/my_schema.grocery_forecaster/3 # invalid in UC -``` - ---- - -## Model logging (sklearn-flavored) - -```python -mlflow.sklearn.log_model( - sk_model=, - artifact_path="model", # convention — keep as "model" - signature=, # REQUIRED — use infer_signature() - input_example=, # REQUIRED — 5 real rows - registered_model_name=None, # leave None; register separately (cleaner) - code_paths=, - extra_pip_requirements=, # only if custom deps beyond environment -) -``` - -**Signature inference:** -```python -from mlflow.models import infer_signature -signature = infer_signature(X_train, model.predict(X_train[:5])) -``` - -**Other flavors with identical signature:** -- `mlflow.xgboost.log_model(xgb_model=..., ...)` -- `mlflow.pytorch.log_model(pytorch_model=..., ...)` -- `mlflow.tensorflow.log_model(model=..., ...)` -- `mlflow.pyfunc.log_model(python_model=..., artifact_path=..., ...)` — for custom PythonModel wrappers - ---- - -## Explicit registration - -```python -result = mlflow.register_model( - model_uri=f"runs:/{run_id}/model", # "runs://" - name="..", # three-level, not optional - tags=, -) -# result.name: str — fully qualified name -# result.version: str — newly-created version (e.g., "1", "2") -``` - ---- - -## Alias management - -```python -from mlflow import MlflowClient -client = MlflowClient() - -# Set (creates if missing, moves if exists) -client.set_registered_model_alias( - name="..", - alias="champion", # or "challenger", or custom - version="", # accepts str or int -) - -# Get current alias mapping -model = client.get_registered_model("..") -print(model.aliases) # {"champion": "3", "challenger": "4"} - -# Delete -client.delete_registered_model_alias( - name="..", - alias="challenger", -) -``` - ---- - -## Loading — notebook / single-node - -```python -model = mlflow.pyfunc.load_model( - model_uri="models:/..@champion", -) - -# Predict on a pandas DataFrame matching the signature -predictions = model.predict(features_df) -``` - -**Returns:** `mlflow.pyfunc.PyFuncModel`, regardless of the original flavor. Expose `.metadata.signature` for schema. - ---- - -## Loading — distributed / Lakeflow SDP - -```python -predict_udf = mlflow.pyfunc.spark_udf( - spark, - model_uri="models:/..@champion", - result_type="double", # or "array" for multi-output - env_manager="local", # "local" | "virtualenv" | "conda" -) - -# Apply to a Spark DataFrame -df_with_predictions = df.withColumn( - "prediction", - predict_udf("feature_a", "feature_b", "feature_c"), -) -``` - -**Construct ONCE at module scope** in Lakeflow pipelines. See `GOTCHAS.md` #11. - ---- - -## Model introspection - -```python -from mlflow.models import get_model_info - -info = get_model_info("models:/..@champion") -info.signature # ModelSignature with inputs/outputs -info.flavors # {"sklearn": {...}, "python_function": {...}} -info.utc_time_created -info.model_uuid -``` - -Useful when debugging load-vs-predict mismatches. - ---- - -## Run + experiment queries (introspection) - -```python -runs = mlflow.search_runs( - experiment_names=["/Users/me@company.com/forecasting"], - filter_string="metrics.r2 > 0.8", - order_by=["metrics.r2 DESC"], - max_results=5, -) -# Returns a pandas DataFrame with run_id, metrics, params, etc. - -best_run_id = runs.iloc[0]["run_id"] -``` - ---- - -## SQL introspection (UC-native) - -```sql --- Does the model exist and which aliases are set? -DESCRIBE MODEL ..; - --- List all model versions -SHOW MODEL VERSIONS ON MODEL ..; - --- Check grants -SHOW GRANTS ON MODEL ..; -SHOW GRANTS ON SCHEMA .; -``` - ---- - -## What's NOT in this skill - -If you see these in code, you're likely in the wrong skill: - -| API | Belongs in | -|-----|------------| -| `mlflow.genai.evaluate(...)` | `databricks-mlflow-evaluation` | -| `@scorer` decorator, `GuidelinesJudge`, etc. | `databricks-mlflow-evaluation` | -| `databricks.sdk.service.serving.EndpointCoreConfigInput` | `databricks-model-serving` | -| `ai_query('', ...)` | Wrong pattern — use `pyfunc.load_model` or `spark_udf` instead (see `GOTCHAS.md` #9) | -| `transition_model_version_stage(...)` | Deprecated — use aliases (see `GOTCHAS.md` #6) | diff --git a/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md b/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md index a2ab11d4..586b8ce6 100644 --- a/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md +++ b/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md @@ -1,301 +1,161 @@ -# GOTCHAS — Classic ML on MLflow + Unity Catalog +# Databricks / Unity Catalog Gotchas -Fourteen mistakes that silently waste hours. Read before writing any code. +Only the Databricks + Unity Catalog-specific failures are here. Generic MLflow, sklearn, and modeling advice intentionally lives elsewhere. ---- - -## 1. Missing `mlflow.set_registry_uri("databricks-uc")` → workspace registry - -**Symptom:** `register_model` succeeds, but the model doesn't appear in Catalog Explorer. It's in the legacy **workspace registry** (visible under the MLflow icon in the left nav), not Unity Catalog. - -**Fix:** -```python -import mlflow -mlflow.set_registry_uri("databricks-uc") # MUST come before register_model / load_model -``` +## Runtime Gotcha Matrix -**Verification:** -```python -assert mlflow.get_registry_uri() == "databricks-uc" -``` - -**Why it bites:** defaults still route to the workspace registry for backward compatibility. The only indicator you missed it is a URL that shows `/ml/models/` instead of `/explore/data/models///`. +| Area | MLflow 2.x | MLflow 3.x / newer Databricks guidance | +|------|------------|-----------------------------------------| +| Model artifact argument | `artifact_path="model"` is common | Prefer `name="model"`; `artifact_path` warns and may disappear later | +| UC lifecycle | Stages already deprecated for UC | Use aliases only: `@champion`, `@challenger`, custom aliases | +| Registry target | Workspace registry remains default unless changed | Still call `mlflow.set_registry_uri("databricks-uc")` explicitly | --- -## 2. Two-level model names → rejected or wrong registry +## 1. Missing `mlflow.set_registry_uri("databricks-uc")` -**Symptom:** `RestException: INVALID_PARAMETER_VALUE: Invalid model name`, or the model registers to the workspace registry silently. +**How it fails:** Silent. `register_model` succeeds, but the model lands in the legacy workspace registry, not Unity Catalog; Catalog Explorer cannot find it. -**Fix:** always use three-level names: `catalog.schema.model_name`. +**Fix:** call this before any register or load: ```python -# WRONG -mlflow.register_model(model_uri, "my_model") -mlflow.register_model(model_uri, "my_schema.my_model") - -# CORRECT -mlflow.register_model(model_uri, "my_catalog.my_schema.my_model") +mlflow.set_registry_uri("databricks-uc") +assert mlflow.get_registry_uri() == "databricks-uc" ``` -**Why it bites:** the error message depends on the registry URI. With UC URI + two-level name → parameter error. With workspace URI + two-level name → registers successfully to workspace (the silently-wrong case). +**Why:** MLflow keeps workspace-registry defaults for backward compatibility, so the API call can succeed in the wrong registry. --- -## 3. Loading with version number instead of alias +## 2. Not using a three-level UC model name -**Symptom:** works today, breaks tomorrow when someone registers a new version. You've hard-coded a version number into every downstream consumer. +**How it fails:** Loud with UC registry (`INVALID_PARAMETER_VALUE`), but silent-wrong if you also forgot `set_registry_uri`: two-level names can register to the workspace registry. -**Fix:** load via alias, never version. +**Fix:** always use `catalog.schema.model_name`. ```python -# FRAGILE — every retrain requires updating every loader -model = mlflow.pyfunc.load_model("models:/my_catalog.my_schema.my_model/3") +# Wrong +"my_model" +"my_schema.my_model" -# STABLE — promote a new version by moving @champion; no loader changes -model = mlflow.pyfunc.load_model("models:/my_catalog.my_schema.my_model@champion") +# Correct +"my_catalog.my_schema.my_model" ``` -**Why it bites:** aliases are the UC-native way to decouple loader code from model lifecycle. Version numbers are legacy. New infrastructure (Lakeflow, Genie) assumes alias-based loading. +**Why:** Unity Catalog models are securable objects under a catalog and schema; workspace-registry names are not. --- -## 4. Experiment creation without UC volume `artifact_location` +## 3. Experiment artifact location is not a UC volume -**Symptom:** experiment creates, but any `log_model` call fails with storage / permission errors. Or artifacts land in DBFS root (deprecated) and can't be loaded downstream. +**How it fails:** Usually loud later, not at setup: `log_model` or artifact upload fails with storage / permission errors. In older patterns, artifacts may silently land in DBFS root, which breaks UC governance expectations. -**Fix:** when you create the experiment, pin it to a UC volume. +**Fix:** set a UC volume-backed artifact location when creating the experiment. ```python -# Prerequisite: the UC volume must exist -# CREATE VOLUME my_catalog.my_schema.mlflow_artifacts; - mlflow.set_experiment( experiment_name="/Users/me@company.com/forecasting", artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", ) ``` -**Why it bites:** the default `artifact_location` used to be DBFS root. Unity-Catalog-enforced workspaces reject DBFS root writes, so `log_model` fails with opaque errors. Pointing at a UC volume makes artifact storage first-class-governed and keeps lineage intact. - -**When the experiment already exists without a UC volume:** you can't retroactively change `artifact_location`. Either (a) delete + recreate, or (b) create a new experiment. Don't try to relocate artifacts manually. - ---- - -## 5. Trusting `register_model` success without verifying in UC - -**Symptom:** `register_model` returns a `ModelVersion` object. Feels successful. But the model is in workspace registry, or the version number is stale, or an alias wasn't set. - -**Fix:** always verify explicitly. - -```sql --- In a SQL cell or notebook: -DESCRIBE MODEL my_catalog.my_schema.my_model; -``` - -Or via Python: -```python -from mlflow import MlflowClient -model = MlflowClient().get_registered_model("my_catalog.my_schema.my_model") -assert "champion" in model.aliases, "Missing @champion alias" -``` - -Or visually: open Catalog Explorer → `my_catalog` → `my_schema` → **Models** tab. If the model is under MLflow's workspace UI instead, you registered to the wrong place (see #1). - -**Why it bites:** `register_model`'s return value only tells you a version was created. It doesn't tell you *where* or *with what aliases*. The Navigator's V-step in pair programming: verify before trusting. +**Why:** UC-enforced workspaces reject unmanaged DBFS-root artifact writes; UC volumes keep model artifacts governed and loadable. --- -## 6. Setting the alias to `"production"` or `"staging"` (legacy MLflow stages) +## 4. Using legacy `Production` / `Staging` stages -**Symptom:** you remember MLflow had `stage="Production"` / `"Staging"` transitions. You try the same with aliases and nothing recognizes them. +**How it fails:** Silent or misleading. Stage APIs such as `transition_model_version_stage()` are deprecated / ineffective for UC models; aliases named `"Production"` may exist as labels but are not treated as lifecycle stages. -**Fix:** UC model aliases are free-form labels. The conventions are `@champion` (current winner) and `@challenger` (under evaluation). MLflow stages are deprecated in the UC registry. +**Fix:** use UC aliases by convention: ```python -# WRONG (legacy stage concept) -MlflowClient().set_registered_model_alias(name, "Production", version) - -# CORRECT MlflowClient().set_registered_model_alias(name, "champion", version) +MlflowClient().set_registered_model_alias(name, "challenger", version) ``` -**Why it bites:** the old `transition_model_version_stage()` API still exists but is a no-op on UC-registered models. No error, no effect. +**Why:** Unity Catalog model lifecycle moved from stages to free-form aliases; downstream loaders should use `models:/name@champion`. --- -## 7. Missing `CREATE MODEL ON SCHEMA` permission +## 5. Missing `CREATE MODEL ON SCHEMA` -**Symptom:** `RestException: PERMISSION_DENIED: User ... does not have CREATE MODEL permission`. +**How it fails:** Loud. `register_model` raises `PERMISSION_DENIED: User ... does not have CREATE MODEL permission`. -**Fix:** grant the permission at the schema level. +**Fix:** ask the schema owner for the schema-level model-creation grant. ```sql GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `user@company.com`; --- Or for a group: -GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `data-science-team`; -``` - -**Why it bites:** workspace admins often assume `USE SCHEMA` covers model registration. It doesn't — `CREATE MODEL` is a separate UC privilege that must be granted explicitly. - -**Verification:** -```sql SHOW GRANTS ON SCHEMA my_catalog.my_schema; ``` ---- - -## 8. Logging a model without `signature` or `input_example` - -**Symptom:** `mlflow.pyfunc.load_model(...)` returns an object, but `.predict(spark_df)` raises cryptic coercion errors. Or predictions silently cast (int → float, string → category) and produce wrong numbers. - -**Fix:** always log both. - -```python -from mlflow.models import infer_signature - -signature = infer_signature(X_train, model.predict(X_train[:5])) -mlflow.sklearn.log_model( - sk_model=model, - artifact_path="model", - signature=signature, - input_example=X_train.iloc[:5], # 5 real rows for the pyfunc wrapper to introspect -) -``` - -**Why it bites:** without a signature, the pyfunc wrapper can't coerce inputs — it accepts whatever you pass, then downstream operations (especially `spark_udf`) fail or produce wrong results. `input_example` is what `pyfunc.load_model` reads to build the wrapper's input coercer. - ---- - -## 9. `ai_query` used for batch inference on a custom UC model - -**Symptom:** you want batch inference on your custom-registered model. You see `ai_query()` in Genie docs and assume it works. It doesn't (for custom models) — `ai_query` only invokes **serving endpoints**, and your UC-registered model isn't behind one unless you deployed a serving endpoint for it. - -**Fix:** for batch inference, use `pyfunc.load_model` (notebook) or `pyfunc.spark_udf` (Lakeflow SDP pipeline). - -```python -# WRONG for custom UC models — requires a serving endpoint -spark.sql(f"SELECT ai_query('{MODEL_NAME}', features) FROM silver_features") - -# CORRECT — notebook batch (single node) -model = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") -predictions = model.predict(features_pandas_df) - -# CORRECT — Lakeflow SDP batch (distributed) -predict_udf = mlflow.pyfunc.spark_udf(spark, f"models:/{MODEL_NAME}@champion", result_type="double") -silver_features.withColumn("prediction", predict_udf(*feature_cols)) -``` - -**Why it bites:** `ai_query` *is* the right call for Foundation Model API endpoints (`ai_query('databricks-dbrx-instruct', prompt)`). The naming overlap leads to wrong assumptions for custom models. +**Why:** `USE CATALOG` and `USE SCHEMA` are not enough; model creation is a separate UC privilege. --- -## 10. Trying to delete / re-register a model at the same version number +## 6. Assuming `ai_query` is batch inference for custom UC models -**Symptom:** `RestException: ALREADY_EXISTS` when re-registering. You can't reuse version numbers. +**How it fails:** Loud or wrong-primitive. `ai_query` calls serving endpoints; a UC-registered custom model is not automatically a serving endpoint. -**Fix:** UC versions are monotonically-increasing and immutable. To supersede a bad version, register a new version and move `@champion` to it. The old version stays in history for lineage. +**Fix:** for batch inference, use: ```python -new_result = mlflow.register_model(new_run_uri, MODEL_NAME) -MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", new_result.version) -# Old version is still there; that's correct. Lineage preserved. +mlflow.pyfunc.load_model("models:/catalog.schema.model@champion") # notebook / pandas path +mlflow.pyfunc.spark_udf(spark, "models:/catalog.schema.model@champion", result_type="double") ``` -**Why it bites:** habits from the workspace registry (where deletion was forgiving) don't transfer. UC treats model versions as first-class auditable artifacts. +**Why:** registration and serving are separate. `ai_query` belongs to Model Serving / Foundation Model endpoint workflows, not ordinary UC batch scoring. --- -## 11. `pyfunc.spark_udf` constructed inside a function call +## 7. Constructing `spark_udf` inside a Lakeflow SDP function -**Symptom:** in a Lakeflow SDP `@dp.materialized_view`, the UDF is constructed every time the view evaluates — slow and sometimes fails with serialization errors. +**How it fails:** Often loud and slow: repeated model deserialization, serialization errors, or pipeline refreshes that hang / retry. Sometimes just silently expensive. -**Fix:** construct the UDF at module scope, reuse it inside the view. +**Fix:** construct the UDF once at module scope and call it inside `@dp.table` / `@dp.materialized_view`. ```python -import mlflow -import databricks.declarative_pipelines as dp - -# Construct ONCE, at module scope mlflow.set_registry_uri("databricks-uc") predict_udf = mlflow.pyfunc.spark_udf( spark, - f"models:/{MODEL_NAME}@champion", + "models:/catalog.schema.model@champion", result_type="double", ) - -@dp.materialized_view -def gold_forecast(): - return spark.read.table("silver_features").withColumn( - "prediction", - predict_udf("feat_a", "feat_b", "feat_c"), - ) ``` -**Why it bites:** Lakeflow SDP may evaluate the function definition multiple times. Model deserialization is expensive — don't repeat it. +**Why:** Lakeflow SDP can evaluate dataset functions repeatedly; model loading belongs at module import time, not inside the dataset function body. --- -## 12. `mlflow[databricks]` extras missing when running outside Databricks +## 8. Missing `mlflow[databricks]` extras outside Databricks compute -**Symptom:** training + logging works; `register_model` fails with `MlflowException: Unable to import necessary dependencies to access model version files in Unity Catalog` — root cause `ModuleNotFoundError: No module named 'azure'` (for Azure-hosted workspaces) or `'boto3'` (AWS) / `'google.cloud'` (GCP). +**How it fails:** Loud. Local laptop / CI / non-Databricks jobs may train and log, then fail on UC registration with missing cloud SDK imports such as `azure`, `boto3`, or `google.cloud`. -**Fix:** install the `databricks` extras, which pull cloud-storage SDKs MLflow needs to stage artifacts into the UC-managed location. +**Fix:** ```bash pip install 'mlflow[databricks]' -# or, for a lighter install: +# or pip install 'mlflow-skinny[databricks]' ``` -**Why it bites:** plain `pip install mlflow` leaves out the cloud-provider SDKs because they're large and most local workflows don't need them. UC registration REQUIRES them because the registry stages artifacts into cloud-managed storage (Azure ADLS / S3 / GCS), and MLflow uses the provider's SDK for the upload. Local `log_model` works fine (artifacts go to the tracking server); registration doesn't. - -**When it most commonly hits:** running training scripts from a laptop, CI runner, or non-Databricks compute — anywhere that isn't a Databricks cluster (which ships the extras pre-installed). +**Why:** UC registration stages artifacts through cloud-managed storage; the Databricks extras include the provider SDKs that plain `mlflow` may omit. --- -## 13. `artifact_path=` parameter is deprecated; new name is `name=` +## 9. Using deprecated `artifact_path=` instead of `name=` -**Symptom:** warning in logs: `WARNING mlflow.models.model: `artifact_path` is deprecated. Please use `name` instead.` Still works today; may break in a future MLflow major version. +**How it fails:** Noisy now, possibly loud later. Newer MLflow warns that `artifact_path` is deprecated; future major versions may remove it. -**Fix:** use `name=` instead of `artifact_path=` in `log_model` calls. +**Fix:** prefer: ```python -# OLD (still works, warns) -mlflow.sklearn.log_model(sk_model=model, artifact_path="model", ...) - -# NEW (preferred, no warning) -mlflow.sklearn.log_model(sk_model=model, name="model", ...) -``` - -**Why it bites:** most online tutorials and training courses still use `artifact_path`. The rename shipped in MLflow 2.16. `name=` semantics are identical — still the within-run artifact folder. Aliases this to the preferred parameter, not a rename of what the parameter represents. - ---- - -## 14. Custom preprocessing not captured in the logged model - -**Symptom:** in the training notebook, predictions are accurate. After `pyfunc.load_model(...)`, predictions are garbage. The pipeline works in training because you're calling `scaler.transform()` manually; at inference time, nobody calls the scaler. - -**Fix:** wrap preprocessing + model in an `sklearn.pipeline.Pipeline` (or a custom `PythonModel` for non-sklearn preprocessing). Log the whole pipeline. - -```python -from sklearn.pipeline import Pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.ensemble import GradientBoostingRegressor - -pipeline = Pipeline([ - ("scaler", StandardScaler()), - ("model", GradientBoostingRegressor()), -]) -pipeline.fit(X_train, y_train) - -# Logs both the fitted scaler AND the model as a single artifact mlflow.sklearn.log_model( - sk_model=pipeline, - artifact_path="model", - signature=infer_signature(X_train, pipeline.predict(X_train[:5])), - input_example=X_train.iloc[:5], + sk_model=model, + name="model", + signature=signature, + input_example=input_example, ) ``` -**Why it bites:** the most painful post-registration bug. Training and inference code paths are different files; the divergence is invisible until predictions are obviously wrong. +**Why:** MLflow renamed the within-run model artifact argument; the value still becomes the path used by `runs://model`. diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md b/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md deleted file mode 100644 index ed4d86ae..00000000 --- a/databricks-skills/databricks-mlflow-ml/references/patterns-batch-inference.md +++ /dev/null @@ -1,244 +0,0 @@ -# patterns-batch-inference - -Loading a UC-registered model and scoring features in batch. Two scales — interactive notebook (Pattern 1–2) and distributed Lakeflow pipeline (Patterns 3–4). Plus A/B validation (Pattern 5). - ---- - -## Pattern 1: Notebook batch inference — pandas path - -For interactive exploration, ad-hoc scoring, and sample sizes up to ~10k rows. - -```python -import mlflow - -mlflow.set_registry_uri("databricks-uc") - -model = mlflow.pyfunc.load_model( - "models:/my_catalog.my_schema.grocery_forecaster@champion" -) - -# Load a sample of features (LIMIT in SQL to avoid loading full table) -features = ( - spark.table("my_catalog.my_schema.silver_features") - .orderBy("month_date") - .limit(1000) - .toPandas() -) - -# The model's signature determines which columns it expects -feature_cols = model.metadata.get_input_schema().input_names() - -predictions = model.predict(features[feature_cols]) - -# Attach predictions for display/export -features["prediction"] = predictions -display(spark.createDataFrame(features)) -``` - ---- - -## Pattern 2: Notebook batch inference with chart - -Same pattern, adds a predicted-vs-actual visual. Useful as a demo artifact. - -```python -import matplotlib.pyplot as plt - -# (continuing from Pattern 1) -features_with_pred = features.sort_values("month_date") - -fig, ax = plt.subplots(figsize=(10, 5)) -ax.plot(features_with_pred["month_date"], features_with_pred["actual"], - label="Actual", linewidth=2) -ax.plot(features_with_pred["month_date"], features_with_pred["prediction"], - label="Predicted", linestyle="--", linewidth=2) -ax.set_xlabel("Month") -ax.set_ylabel("Turnover (millions)") -ax.set_title(f"Forecast — {model.metadata.run_id[:8]}") -ax.legend() -plt.xticks(rotation=45) -plt.tight_layout() -display(fig) -``` - ---- - -## Pattern 3: Lakeflow SDP batch via `spark_udf` - -For scheduled batch inference at scale. Distributes across Spark executors — no per-row Python overhead, no serving endpoint. - -```python -# src/gold/gold_forecast.py -import mlflow -import databricks.declarative_pipelines as dp - -# Construct the UDF ONCE at module scope — see GOTCHAS #11 -mlflow.set_registry_uri("databricks-uc") - -MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" -predict_udf = mlflow.pyfunc.spark_udf( - spark, - model_uri=f"models:/{MODEL_NAME}@champion", - result_type="double", - env_manager="local", # "local" avoids conda/virtualenv setup overhead -) - -@dp.materialized_view( - comment="Grocery turnover forecast from @champion model", -) -def gold_forecast(): - return ( - spark.read.table("my_catalog.my_schema.silver_features") - .withColumn( - "forecast_turnover_millions", - predict_udf( - "turnover_lag_1", - "turnover_lag_12", - "rolling_3m_avg", - "state_share_of_national", - # ... pass each signature input column in the order the signature declares - ), - ) - ) -``` - -**What this gives you:** -- A `gold_forecast` table that refreshes on every pipeline run -- Distributed scoring (no serving endpoint, no auth token) -- Full UC lineage: `silver_features` → `gold_forecast` via `grocery_forecaster@champion` -- Genie can query it: *"what's the forecast for each state next month?"* - ---- - -## Pattern 4: `spark_udf` with `result_type` for multi-output models - -Multi-output regressors or classifiers need a richer result type. - -```python -from pyspark.sql.types import ArrayType, DoubleType, StructType, StructField - -# Multi-output regression — model returns 2 predictions per row -predict_udf = mlflow.pyfunc.spark_udf( - spark, - model_uri=f"models:/{MODEL_NAME}@champion", - result_type=ArrayType(DoubleType()), -) - -# Classifier with probabilities -predict_udf = mlflow.pyfunc.spark_udf( - spark, - model_uri=f"models:/{MODEL_NAME}@champion", - result_type=StructType([ - StructField("class", StringType(), True), - StructField("confidence", DoubleType(), True), - ]), -) -``` - ---- - -## Pattern 5: A/B validation — compare `@challenger` vs `@champion` - -Run both models on a validation set, compare error metrics, decide whether to promote. - -```python -import mlflow -from sklearn.metrics import mean_absolute_error, root_mean_squared_error - -mlflow.set_registry_uri("databricks-uc") -MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" - -champion = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") -challenger = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@challenger") - -# Hold-out validation set (not seen during training) -validation = spark.table(f"{MODEL_NAME.rsplit('.', 1)[0]}.validation_features").toPandas() -feature_cols = champion.metadata.get_input_schema().input_names() -actuals = validation["turnover_millions"] - -champion_preds = champion.predict(validation[feature_cols]) -challenger_preds = challenger.predict(validation[feature_cols]) - -print(f"Champion RMSE: {root_mean_squared_error(actuals, champion_preds):.2f}") -print(f"Challenger RMSE: {root_mean_squared_error(actuals, challenger_preds):.2f}") -print(f"Champion MAE: {mean_absolute_error(actuals, champion_preds):.2f}") -print(f"Challenger MAE: {mean_absolute_error(actuals, challenger_preds):.2f}") - -# Decision logic — promote if challenger beats champion by >2% -if root_mean_squared_error(actuals, challenger_preds) < root_mean_squared_error(actuals, champion_preds) * 0.98: - print("→ Promote @challenger. See patterns-uc-registration.md Pattern 5.") -else: - print("→ Keep @champion. Delete @challenger.") -``` - ---- - -## Pattern 6: Structured streaming inference - -For models scoring events as they arrive (not batch-scheduled). - -```python -from pyspark.sql.functions import col - -predict_udf = mlflow.pyfunc.spark_udf( - spark, - model_uri=f"models:/{MODEL_NAME}@champion", - result_type="double", -) - -events = ( - spark.readStream - .format("delta") - .table("my_catalog.my_schema.silver_events") -) - -scored = events.withColumn( - "prediction", - predict_udf(*[col(c) for c in feature_cols]), -) - -( - scored.writeStream - .format("delta") - .outputMode("append") - .option("checkpointLocation", "dbfs:/Volumes/my_catalog/my_schema/checkpoints/scoring") - .toTable("my_catalog.my_schema.gold_scored_events") -) -``` - -For most classic-ML batch use cases, Pattern 3 (Lakeflow SDP) is simpler. Use streaming only when event-time scoring matters. - ---- - -## What NOT to do for batch inference - -### Do not use `ai_query` for custom UC models - -`ai_query('', )` requires the model to be deployed as a **Model Serving endpoint**. UC-registered models are NOT automatically behind an endpoint. Use `pyfunc.load_model` (Pattern 1) or `pyfunc.spark_udf` (Pattern 3) instead. - -`ai_query` IS the right call for: -- Foundation Model API endpoints: `ai_query('databricks-dbrx-instruct', prompt)` -- Model Serving endpoints you've explicitly provisioned - -See `GOTCHAS.md` #9. - -### Do not use `mlflow.pyfunc.load_model` for billion-row batches on a single node - -Pattern 1 collects to pandas — fine up to ~10k rows, painful beyond ~100k, impossible for millions. For distributed scale, use Pattern 3 (`spark_udf`). - -### Do not construct `spark_udf` inside the function body - -See `GOTCHAS.md` #11. Construct once at module scope, reuse inside `@dp.materialized_view` / `@dp.table`. - ---- - -## Troubleshooting batch inference - -| Error | Cause | Fix | -|-------|-------|-----| -| `RESOURCE_DOES_NOT_EXIST` on load | Wrong registry URI or two-level name | `GOTCHAS.md` #1, #2 | -| Predictions are NaN | Input columns in wrong order | Pass columns in the order `model.metadata.get_input_schema().input_names()` declares | -| `PERMISSION_DENIED: EXECUTE ON MODEL` | No read access to model | `GRANT EXECUTE ON MODEL ... TO ` | -| `spark_udf` raises `PicklingError` | Model has un-picklable state (e.g., Spark session) | Re-train ensuring the model is pure Python/numpy — don't capture `spark` at training time | -| Pipeline hangs on `gold_forecast` | Model artifact is large; first load is slow | Normal — subsequent runs are fast (UDF is cached per executor) | -| Column type mismatch in Spark | UDF expects double; column is int/string | Cast explicitly: `col("feature").cast("double")` | diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md b/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md deleted file mode 100644 index 00c6e2ba..00000000 --- a/databricks-skills/databricks-mlflow-ml/references/patterns-experiment-setup.md +++ /dev/null @@ -1,141 +0,0 @@ -# patterns-experiment-setup - -Experiments in UC-enforced workspaces need more setup than older MLflow guides show. The critical change: you must pin the experiment's `artifact_location` to a Unity Catalog volume, or `log_model` will fail with storage errors. - ---- - -## Pattern 1: Create experiment with UC volume artifact_location - -```python -import mlflow - -mlflow.set_registry_uri("databricks-uc") # always first - -# Prerequisite: the UC volume must exist -# CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts; - -mlflow.set_experiment( - experiment_name="/Users/me@company.com/forecasting", - artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", -) -``` - -**Why both are required:** -- `experiment_name` — the workspace-visible path (browsable from the Experiments UI) -- `artifact_location` — where logged artifacts (model binaries, plots, datasets) physically live - -In older workspaces, `artifact_location` defaulted to DBFS root. UC-enforced workspaces reject DBFS root writes, so `log_model` fails with opaque errors like: - -``` -MlflowException: API request to endpoint /api/2.0/mlflow/runs/log-artifact failed -with error code 403 != 200. Response body: PERMISSION_DENIED ... -``` - -Pointing at a UC volume resolves this AND makes artifacts first-class-governed under UC lineage. - ---- - -## Pattern 2: Create the volume if it doesn't exist (idempotent) - -Run once per schema, before any experiment creation: - -```python -spark.sql(f""" - CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts - COMMENT 'MLflow experiment artifacts for forecasting models' -""") -``` - -Or via SQL editor: - -```sql -CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts; -``` - -**Permissions needed:** `USE SCHEMA` + `CREATE VOLUME`. If missing, request `CREATE VOLUME ON SCHEMA my_catalog.my_schema` from the schema owner. - ---- - -## Pattern 3: Experiment already exists, wrong `artifact_location` - -You can't retroactively change `artifact_location`. Three options, in order of preference: - -**Option A — New experiment** (cleanest, keeps old runs intact): -```python -mlflow.set_experiment( - experiment_name="/Users/me@company.com/forecasting_v2", # v2 suffix - artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting_v2", -) -# New runs land in v2. Old runs stay in v1 (archive them if you like). -``` - -**Option B — Delete + recreate** (loses history; use only if no good runs exist): -```python -from mlflow import MlflowClient -client = MlflowClient() - -exp = client.get_experiment_by_name("/Users/me@company.com/forecasting") -client.delete_experiment(exp.experiment_id) - -mlflow.set_experiment( - experiment_name="/Users/me@company.com/forecasting", - artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", -) -``` - -**Option C — Manual relocation of DBFS artifacts to UC volume**: do not do this. Storage paths are resolved at log time and encoded in the run's metadata; moving files doesn't update the pointers. - ---- - -## Pattern 4: Verify experiment is correctly configured - -After setup, before training: - -```python -exp = mlflow.get_experiment_by_name("/Users/me@company.com/forecasting") -assert exp is not None, "Experiment not created" -assert exp.artifact_location.startswith("dbfs:/Volumes/"), ( - f"artifact_location is not a UC volume: {exp.artifact_location}" -) -print(f"Experiment ID: {exp.experiment_id}") -print(f"Artifact location: {exp.artifact_location}") -``` - -If the assert fails, you have an old experiment pointing at DBFS root. Apply Pattern 3. - ---- - -## Pattern 5: Workspace-path vs Repo-path experiments - -MLflow accepts two conventions for `experiment_name`: - -```python -# Workspace-path convention (recommended for collaborative experiments) -mlflow.set_experiment(experiment_name="/Users/me@company.com/forecasting") - -# Repo-path convention (only if you're running from a Git folder) -mlflow.set_experiment(experiment_name="/Repos/me@company.com/my-repo/forecasting") -``` - -**Prefer workspace path** for experiments shared across pairs/teams. Repo-path experiments become orphans when the repo is deleted. - -**Both need `artifact_location` pointing at a UC volume.** The path convention only affects where the experiment metadata is browsable, not where artifacts live. - ---- - -## Pattern 6: Running from a notebook cell with autoselected experiment - -Databricks notebooks auto-associate runs with an experiment matching the notebook's workspace path: - -```python -# In a notebook at /Users/me@company.com/Notebooks/train.py -# Databricks will auto-set experiment_name to the notebook path -# BUT the default artifact_location is still DBFS root — you still need to override: - -mlflow.set_experiment( - experiment_name="/Users/me@company.com/Notebooks/train", - artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/train", -) -``` - -Or call `set_experiment` explicitly before the first `start_run` — the artifact_location fix must be applied regardless of notebook auto-association. diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-training.md b/databricks-skills/databricks-mlflow-ml/references/patterns-training.md deleted file mode 100644 index 017e3cfb..00000000 --- a/databricks-skills/databricks-mlflow-ml/references/patterns-training.md +++ /dev/null @@ -1,205 +0,0 @@ -# patterns-training - -How to log classic ML models (sklearn / XGBoost / PyTorch) so they register cleanly and load correctly downstream. The two load-bearing decisions: `signature` and `input_example`. - ---- - -## Pattern 1: Baseline sklearn training loop - -```python -import mlflow -import mlflow.sklearn -from sklearn.ensemble import GradientBoostingRegressor -from sklearn.metrics import root_mean_squared_error, mean_absolute_error -from sklearn.model_selection import train_test_split -from mlflow.models import infer_signature - -mlflow.set_registry_uri("databricks-uc") -mlflow.set_experiment( - experiment_name="/Users/me@company.com/forecasting", - artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", -) - -X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2) - -with mlflow.start_run(run_name="gbr_baseline"): - model = GradientBoostingRegressor(n_estimators=100, max_depth=3) - model.fit(X_train, y_train) - - # Signature + input_example are both load-bearing - signature = infer_signature(X_train, model.predict(X_train[:5])) - - mlflow.sklearn.log_model( - sk_model=model, - artifact_path="model", - signature=signature, - input_example=X_train.iloc[:5], - ) - - # Log everything needed to reproduce - mlflow.log_params({"n_estimators": 100, "max_depth": 3}) - predictions = model.predict(X_test) - mlflow.log_metrics({ - "rmse": root_mean_squared_error(y_test, predictions), - "mae": mean_absolute_error(y_test, predictions), - }) -``` - ---- - -## Pattern 2: Preprocessing + model as a Pipeline - -Always log preprocessing alongside the model. See `GOTCHAS.md` #12 — inference-time preprocessing drift is the most painful post-registration bug. - -```python -from sklearn.pipeline import Pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.compose import ColumnTransformer - -numeric_features = ["turnover_lag_1", "turnover_lag_12", "rolling_3m_avg"] -categorical_features = ["state", "industry"] - -preprocessor = ColumnTransformer([ - ("num", StandardScaler(), numeric_features), - ("cat", "passthrough", categorical_features), # handle in the model if needed -]) - -pipeline = Pipeline([ - ("preprocessor", preprocessor), - ("model", GradientBoostingRegressor(n_estimators=100)), -]) - -with mlflow.start_run(): - pipeline.fit(X_train, y_train) - - signature = infer_signature(X_train, pipeline.predict(X_train[:5])) - mlflow.sklearn.log_model( - sk_model=pipeline, # logs both preprocessor AND model as one artifact - artifact_path="model", - signature=signature, - input_example=X_train.iloc[:5], - ) -``` - -At inference time, callers never need to know about `StandardScaler` — they pass raw features, `pyfunc.load_model` dispatches through the pipeline. - ---- - -## Pattern 3: XGBoost / PyTorch — same interface, different flavor - -```python -# XGBoost -import mlflow.xgboost -import xgboost as xgb - -model = xgb.XGBRegressor(n_estimators=100, max_depth=3) -model.fit(X_train, y_train) - -with mlflow.start_run(): - mlflow.xgboost.log_model( - xgb_model=model, - artifact_path="model", - signature=infer_signature(X_train, model.predict(X_train[:5])), - input_example=X_train.iloc[:5], - ) - -# PyTorch -import mlflow.pytorch -import torch - -class Forecaster(torch.nn.Module): - ... - -model = Forecaster() -# ... training loop ... - -with mlflow.start_run(): - # For PyTorch, input_example must be a tensor or numpy array - example = X_train.iloc[:5].to_numpy() - mlflow.pytorch.log_model( - pytorch_model=model, - artifact_path="model", - signature=infer_signature(example, model(torch.tensor(example)).detach().numpy()), - input_example=example, - ) -``` - ---- - -## Pattern 4: Retraining — same experiment, new run - -Retraining for an A/B test or a scheduled refresh. Log to the same experiment; register as a new version in Workflow 2. - -```python -with mlflow.start_run(run_name="gbr_v2_with_seasonality") as run: - model = GradientBoostingRegressor(n_estimators=200, max_depth=4) - model.fit(X_train_with_seasonality, y_train) - - mlflow.sklearn.log_model( - sk_model=model, - artifact_path="model", - signature=infer_signature(X_train_with_seasonality, - model.predict(X_train_with_seasonality[:5])), - input_example=X_train_with_seasonality.iloc[:5], - ) - # Remember the run_id for the register step - print(f"New run: {run.info.run_id}") -``` - ---- - -## Pattern 5: Autologging (quick path for iteration) - -Autologging wraps `fit()` and logs params + metrics + model automatically. Convenient during experimentation; less explicit than manual logging. - -```python -mlflow.sklearn.autolog( - log_models=True, - log_input_examples=True, # IMPORTANT — otherwise no input_example is captured - log_model_signatures=True, # IMPORTANT — otherwise no signature is captured - silent=False, -) - -# Any subsequent fit() call auto-logs -model = GradientBoostingRegressor(n_estimators=100) -model.fit(X_train, y_train) -# Autolog handled the MLflow calls -``` - -**Caveat:** autologging infers signature + input_example heuristically. For production runs, prefer manual logging (Pattern 1) — you control what gets captured. - ---- - -## Pattern 6: Searching runs to pick the best one for registration - -Before registering, you typically want the best run from an experiment: - -```python -runs = mlflow.search_runs( - experiment_names=["/Users/me@company.com/forecasting"], - filter_string="metrics.rmse < 100 AND tags.mlflow.runName LIKE 'gbr_%'", - order_by=["metrics.rmse ASC"], - max_results=1, -) - -if runs.empty: - raise RuntimeError("No runs match criteria") - -best_run_id = runs.iloc[0]["run_id"] -best_rmse = runs.iloc[0]["metrics.rmse"] -print(f"Best run: {best_run_id} (RMSE={best_rmse:.2f})") - -# Now register this run's model — see patterns-uc-registration.md Pattern 1 -``` - ---- - -## Common logging mistakes - -| Mistake | Effect | Fix | -|---------|--------|-----| -| No `signature` | `pyfunc.load_model` works, but `.predict()` coerces wrong | Always call `infer_signature(X_train, y_hat[:5])` | -| No `input_example` | `pyfunc.load_model` can't introspect input schema | Pass `X_train.iloc[:5]` (or `.to_numpy()[:5]` for non-pandas) | -| `artifact_path` changes between logs | Same model name → different paths → broken load URIs | Always use `artifact_path="model"` | -| Log preprocessing separately | Inference callers must reapply preprocessing manually | Wrap in a sklearn `Pipeline` and log the pipeline | -| Use `pickle.dump` directly | Loses MLflow's flavor dispatch | Always use `mlflow..log_model` | diff --git a/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md b/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md deleted file mode 100644 index 4d8929ed..00000000 --- a/databricks-skills/databricks-mlflow-ml/references/patterns-uc-registration.md +++ /dev/null @@ -1,232 +0,0 @@ -# patterns-uc-registration - -Register a logged model to Unity Catalog, set aliases, verify, and handle promotion / rollback. - ---- - -## Pattern 1: Explicit register from a specific run - -Cleanest workflow. Train (separate step) → pick best run → register. - -```python -import mlflow -from mlflow import MlflowClient - -mlflow.set_registry_uri("databricks-uc") - -MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" - -# run_id from a specific training run (see patterns-training.md Pattern 6) -run_id = "abc123def456" - -result = mlflow.register_model( - model_uri=f"runs:/{run_id}/model", - name=MODEL_NAME, - tags={ - "trained_by": "forecasting_team", - "dataset_version": "2024-Q4", - }, -) -print(f"Registered {MODEL_NAME} version {result.version}") -``` - -`result` is a `ModelVersion` object: -- `result.name` — fully qualified three-level name -- `result.version` — the new version (string, e.g., `"3"`) -- `result.status` — should be `"READY"` by the time this returns - ---- - -## Pattern 2: Log-and-register in one call - -Shorter but couples logging and registration. Use when you *know* the current run is the one worth registering. - -```python -with mlflow.start_run(): - model.fit(X_train, y_train) - mlflow.sklearn.log_model( - sk_model=model, - artifact_path="model", - signature=infer_signature(X_train, model.predict(X_train[:5])), - input_example=X_train.iloc[:5], - registered_model_name="my_catalog.my_schema.grocery_forecaster", - ) - # Model is registered as a new version; you still need to set alias separately. -``` - -**Still need a separate alias call** — `log_model` doesn't set aliases. - ---- - -## Pattern 3: Set aliases (`@champion`, `@challenger`) - -Aliases decouple the loader from the version. Moving `@champion` to a new version silently updates every `models:/...@champion` loader. - -```python -from mlflow import MlflowClient -client = MlflowClient() - -# Set or move an alias -client.set_registered_model_alias( - name="my_catalog.my_schema.grocery_forecaster", - alias="champion", - version=result.version, -) -``` - -**Conventions:** -- `@champion` — the current production winner. Exactly one version at a time. -- `@challenger` — a candidate under evaluation. Exactly one at a time. -- Custom aliases — free-form, e.g., `@pair_team_07`, `@nightly`, `@reviewed`. - -**Read existing aliases:** -```python -model = client.get_registered_model("my_catalog.my_schema.grocery_forecaster") -print(model.aliases) # e.g., {"champion": "3", "challenger": "4"} -``` - -**Delete an alias:** -```python -client.delete_registered_model_alias( - name="my_catalog.my_schema.grocery_forecaster", - alias="challenger", -) -``` - ---- - -## Pattern 4: Verify registration (Navigator's V-step) - -Don't trust `register_model`'s success message alone. See `GOTCHAS.md` #5. - -### Via SQL - -```sql -DESCRIBE MODEL my_catalog.my_schema.grocery_forecaster; -``` - -Expected output includes the model metadata and (if set) aliases. If the result is "table or view not found," the model didn't register to UC — check `set_registry_uri` (GOTCHAS #1). - -### Via Catalog Explorer UI - -1. Open Catalog Explorer -2. Navigate to `my_catalog` → `my_schema` → **Models** tab -3. Confirm `grocery_forecaster` appears with an `@champion` badge - -If the model appears under the workspace MLflow icon instead (left sidebar, under MLflow), you registered to the workspace registry. See GOTCHAS #1. - -### Via Python assertion (scriptable) - -```python -from mlflow import MlflowClient -client = MlflowClient() - -model = client.get_registered_model("my_catalog.my_schema.grocery_forecaster") - -# Three assertions that should always hold post-registration -assert model is not None, "Model not registered to UC" -assert len(model.latest_versions) > 0, "No versions exist" -assert "champion" in model.aliases, "@champion alias not set" -print(f"✓ {model.name} v{model.aliases['champion']} is @champion") -``` - ---- - -## Pattern 5: A/B promotion — swap `@challenger` to `@champion` - -You've trained a new version, registered it, and validated its predictions against the current champion. Now promote: - -```python -client = MlflowClient() -MODEL_NAME = "my_catalog.my_schema.grocery_forecaster" - -# Get current state -model = client.get_registered_model(MODEL_NAME) -old_champion = model.aliases.get("champion") -new_champion = model.aliases.get("challenger") - -if new_champion is None: - raise RuntimeError("No @challenger set — nothing to promote") - -# Move the alias (atomic — downstream loaders see the switch on next load) -client.set_registered_model_alias(MODEL_NAME, "champion", new_champion) - -# Optional: archive the old champion version with a custom alias -if old_champion: - client.set_registered_model_alias(MODEL_NAME, f"archived_{old_champion}", old_champion) - -# Remove the @challenger alias -client.delete_registered_model_alias(MODEL_NAME, "challenger") - -print(f"Promoted v{new_champion} from @challenger to @champion (was v{old_champion})") -``` - -**Rollback** is the inverse — move `@champion` back to the previous version. - ---- - -## Pattern 6: List all model versions - -Useful for lineage inspection or cleanup. - -```sql -SHOW MODEL VERSIONS ON MODEL my_catalog.my_schema.grocery_forecaster; -``` - -Or via Python: -```python -from mlflow import MlflowClient -client = MlflowClient() - -versions = client.search_model_versions( - filter_string=f"name='my_catalog.my_schema.grocery_forecaster'", - order_by=["version_number DESC"], -) -for v in versions: - print(f"v{v.version}: run_id={v.run_id}, status={v.status}, aliases={v.aliases}") -``` - ---- - -## Pattern 7: Tags — richer metadata without new versions - -Tags are key-value metadata on the registered model (or a specific version). Useful for: -- Team ownership: `set_model_version_tag(name, "1", "team", "forecasting")` -- Dataset provenance: `set_model_version_tag(name, "1", "dataset_version", "2024-Q4")` -- Review status: `set_model_version_tag(name, "1", "reviewed", "true")` - -```python -from mlflow import MlflowClient -client = MlflowClient() - -# Tag on the registered model (applies to all versions) -client.set_registered_model_tag( - name="my_catalog.my_schema.grocery_forecaster", - key="domain", - value="retail", -) - -# Tag on a specific version -client.set_model_version_tag( - name="my_catalog.my_schema.grocery_forecaster", - version="3", - key="reviewed_by", - value="jane@company.com", -) -``` - -Tags are queryable via `search_model_versions(filter_string="tags.reviewed = 'true'")`. - ---- - -## Permission requirements - -| Operation | Permission needed | Granted via | -|-----------|-------------------|-------------| -| `register_model` (first version of a model) | `CREATE MODEL ON SCHEMA ` | `GRANT CREATE MODEL ON SCHEMA ... TO ...` | -| `register_model` (new version of existing) | `EDIT ON MODEL ` | Automatic for model owner; otherwise grant | -| `set_registered_model_alias` | `EDIT ON MODEL ` | Same as above | -| `get_registered_model` / `DESCRIBE MODEL` | `USE CATALOG` + `USE SCHEMA` + `EXECUTE ON MODEL` | Standard read grants | -| `load_model` | `EXECUTE ON MODEL ` | `GRANT EXECUTE ON MODEL ... TO ...` | - -If any of these fail, request the specific grant from the schema owner. See `GOTCHAS.md` #7. diff --git a/databricks-skills/databricks-mlflow-ml/references/recipes.md b/databricks-skills/databricks-mlflow-ml/references/recipes.md new file mode 100644 index 00000000..db326fad --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/recipes.md @@ -0,0 +1,233 @@ +# UC-Specific Recipes + +These are code shapes, not full sklearn implementations. Use them to get Databricks / Unity Catalog arguments and ordering right. + +## 1. Experiment + UC Volume Setup + +Do this before training if the workspace enforces Unity Catalog storage. + +- Set the registry URI every session: + ```python + mlflow.set_registry_uri("databricks-uc") + ``` +- Create the artifact volume once per schema: + ```sql + CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts; + ``` +- Create / select the experiment with a UC volume artifact location: + ```python + mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", + ) + ``` + +If the experiment already exists with a non-UC artifact location, create a new experiment path. Do not try to move MLflow artifacts manually; run metadata already points at the original location. + +## 2. Log → Register → Alias + +### Logging UC essentials + +When logging the model: + +- Include `signature=infer_signature(X_train, model.predict(X_train[:5]))`. +- Include `input_example=X_train.iloc[:5]` or equivalent real rows. +- Use `name="model"` for MLflow 3.x / newer code; `artifact_path="model"` is the older spelling. +- If preprocessing exists, log the whole pipeline / wrapper, not just the final estimator. + +Shape: + +```python +with mlflow.start_run() as run: + # train your estimator or pipeline here + mlflow..log_model( + =model_or_pipeline, + name="model", + signature=signature, + input_example=input_example, + ) +``` + +### Register + champion alias + +After training: + +```python +result = mlflow.register_model( + f"runs:/{run_id}/model", + "my_catalog.my_schema.my_model", +) +MlflowClient().set_registered_model_alias( + "my_catalog.my_schema.my_model", + "champion", + result.version, +) +``` + +`register_model` returns a `ModelVersion`; `result.version` is a string such as `"1"`. It does **not** set aliases — the alias call is separate and required. + +### Tags syntax + +Tags can be set at registration time: + +```python +result = mlflow.register_model( + f"runs:/{run_id}/model", + MODEL_NAME, + tags={"dataset_version": "2024-Q4", "trained_by": "forecasting_team"}, +) +``` + +Or after registration: + +```python +client.set_registered_model_tag(MODEL_NAME, "domain", "retail") +client.set_model_version_tag(MODEL_NAME, result.version, "reviewed", "true") +``` + +### Minimal UC permission checklist + +| Operation | Required UC privilege | +|-----------|-----------------------| +| First registration of a model in a schema | `CREATE MODEL ON SCHEMA catalog.schema` | +| Registering a new version | `EDIT ON MODEL catalog.schema.model` | +| Setting aliases / tags | `EDIT ON MODEL catalog.schema.model` | +| Loading for inference | `EXECUTE ON MODEL catalog.schema.model` plus `USE CATALOG` / `USE SCHEMA` | + +## 3. Lakeflow SDP `spark_udf` Shape + +For Lakeflow SDP, create the UDF at module scope, not inside the decorated dataset function. + +```python +# src/gold/score_model.py +import mlflow +import databricks.declarative_pipelines as dp + +mlflow.set_registry_uri("databricks-uc") + +MODEL_NAME = "my_catalog.my_schema.my_model" + +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type="double", + env_manager="local", +) + +@dp.materialized_view +def gold_predictions(): + return ( + spark.read.table("my_catalog.my_schema.silver_features") + .withColumn( + "prediction", + predict_udf("feature_a", "feature_b", "feature_c"), + ) + ) +``` + +Pass feature columns in the order expected by the model signature. + +`result_type` shapes: + +| Model output | `result_type` | +|--------------|---------------| +| Single numeric prediction | `"double"` | +| Integer class id | `"long"` | +| String class label | `"string"` | +| Multi-output numeric vector | `"array"` | +| Named outputs | `StructType([...])` | + +Do not use `ai_query` here unless you have explicitly deployed a Model Serving endpoint. + +## 4. A/B Promotion Alias Swap + +This order is intentional: delete old `@champion` before setting the new one. Otherwise, during a botched sequence or retry, the pre-existing alias can still point consumers at the wrong version. + +```python +from mlflow import MlflowClient + +client = MlflowClient() +MODEL_NAME = "my_catalog.my_schema.my_model" + +model = client.get_registered_model(MODEL_NAME) +old_champion = model.aliases.get("champion") +new_champion = model.aliases.get("challenger") + +if new_champion is None: + raise RuntimeError("No @challenger alias set; nothing to promote") + +# Optional: preserve an explicit rollback handle before moving champion. +if old_champion: + client.set_registered_model_alias( + MODEL_NAME, + f"archived_{old_champion}", + old_champion, + ) + +# Required order: remove old champion, then set new champion. +if old_champion: + client.delete_registered_model_alias(MODEL_NAME, "champion") + +client.set_registered_model_alias(MODEL_NAME, "champion", new_champion) + +# Remove challenger after it has become champion. +client.delete_registered_model_alias(MODEL_NAME, "challenger") +``` + +Downstream code using `models:/my_catalog.my_schema.my_model@champion` picks up the new version on next load. No loader code changes. + +Rollback shape: + +```python +client.delete_registered_model_alias(MODEL_NAME, "champion") +client.set_registered_model_alias(MODEL_NAME, "champion", old_champion) +``` + +## 5. Verification One-Liners + +### SQL + +```sql +DESCRIBE MODEL my_catalog.my_schema.my_model; +SHOW MODEL VERSIONS ON MODEL my_catalog.my_schema.my_model; +SHOW GRANTS ON MODEL my_catalog.my_schema.my_model; +SHOW GRANTS ON SCHEMA my_catalog.my_schema; +``` + +If `DESCRIBE MODEL` cannot find it but `register_model` succeeded, suspect the workspace-registry trap: missing `mlflow.set_registry_uri("databricks-uc")`. + +### Alias dictionary shape + +```python +model = MlflowClient().get_registered_model("my_catalog.my_schema.my_model") +model.aliases +# Expected shape: {"champion": "3", "challenger": "4"} +``` + +Use this to confirm that `@champion` exists and points at the version you intended. + +### Signature debugging + +```python +from mlflow.models import get_model_info + +info = get_model_info("models:/my_catalog.my_schema.my_model@champion") +info.signature +info.flavors +``` + +If `info.signature` is missing or does not match the DataFrame columns you pass to `predict`, re-log the model with a signature and input example. + +### Load URI sanity check + +```python +mlflow.pyfunc.load_model("models:/my_catalog.my_schema.my_model@champion") +``` + +Correct URI shape is: + +```text +models:/..@ +``` + +Avoid version-pinned loaders such as `models:/catalog.schema.model/3` unless you are doing forensic debugging. diff --git a/databricks-skills/databricks-mlflow-ml/references/user-journeys.md b/databricks-skills/databricks-mlflow-ml/references/user-journeys.md deleted file mode 100644 index a72f9106..00000000 --- a/databricks-skills/databricks-mlflow-ml/references/user-journeys.md +++ /dev/null @@ -1,195 +0,0 @@ -# user-journeys - -End-to-end workflows with decision points. Read the journey that matches your situation. - ---- - -## Journey 1: First model (train → register → score) — the 90%-case - -Most users arrive here. Goal: a UC-registered model with a `@champion` alias, producing batch predictions. - -**Prerequisites:** -- UC catalog + schema where you have `CREATE MODEL` permission -- A UC volume for MLflow artifacts (create if missing — `patterns-experiment-setup.md` Pattern 2) -- Features in a Spark table (Bronze → Silver → Gold already done) - -**Steps:** - -1. **Set up the experiment** (`patterns-experiment-setup.md` Pattern 1) - - `mlflow.set_registry_uri("databricks-uc")` - - `mlflow.set_experiment(experiment_name=..., artifact_location=)` -2. **Train + log** (`patterns-training.md` Pattern 1 or 2) - - Always include `signature` and `input_example` - - If you have preprocessing, wrap in `sklearn.Pipeline` (Pattern 2) -3. **Register** (`patterns-uc-registration.md` Pattern 1) - - `mlflow.register_model(f"runs:/{run_id}/model", "catalog.schema.model")` -4. **Set alias** (`patterns-uc-registration.md` Pattern 3) - - `client.set_registered_model_alias(name, "champion", version)` -5. **Verify** (`patterns-uc-registration.md` Pattern 4) - - `DESCRIBE MODEL catalog.schema.model` OR Catalog Explorer UI -6. **Load + score** (`patterns-batch-inference.md` Pattern 1 or 2) - - `model = mlflow.pyfunc.load_model("models:/catalog.schema.model@champion")` - - `model.predict(features_df)` - -**Done.** You have a UC-registered model with a canonical loading URI that downstream code can depend on. - ---- - -## Journey 2: Retrain + promote (A/B) - -You already have `@champion`. You trained a new version and want to decide whether to promote it. - -**Prerequisites:** -- Model exists in UC with `@champion` set (you did Journey 1) -- New training run logged to the same experiment - -**Steps:** - -1. **Register new version** (`patterns-uc-registration.md` Pattern 1) - - Same `MODEL_NAME` as before — UC auto-increments version -2. **Set `@challenger`** (`patterns-uc-registration.md` Pattern 3) - - `client.set_registered_model_alias(name, "challenger", new_version)` -3. **A/B validate** (`patterns-batch-inference.md` Pattern 5) - - Load both aliases, score validation set, compare metrics -4. **Decide**: - - Challenger wins → **Pattern 5 in `patterns-uc-registration.md`**: swap aliases - - Champion wins → delete `@challenger` alias, keep current `@champion` -5. **Verify** downstream loaders picked up the new version (after swap) - - Any code using `models:/@champion` will see the new version on next load - ---- - -## Journey 3: Lakeflow SDP batch pipeline - -You want predictions to land in a scheduled gold table, not an ad-hoc notebook. - -**Prerequisites:** -- Model registered with `@champion` (Journey 1 complete) -- Lakeflow SDP pipeline defined (one already running is ideal) - -**Steps:** - -1. **Add a new file** to the pipeline source: `src/gold/gold_forecast.py` -2. **Construct the UDF at module scope** (`patterns-batch-inference.md` Pattern 3) - - `mlflow.set_registry_uri("databricks-uc")` - - `predict_udf = mlflow.pyfunc.spark_udf(spark, "models:/...@champion", result_type="double")` -3. **Define the `@dp.materialized_view`** that reads silver features, applies the UDF -4. **Deploy + run** the pipeline - - `databricks bundle deploy && databricks bundle run ` -5. **Verify** the `gold_forecast` table materializes - - Row count matches `silver_features` - - Query from Genie or SQL editor - -**Do NOT use `ai_query`** in this pipeline — see `GOTCHAS.md` #9. - ---- - -## Journey 4: Debug a registration that went to workspace registry - -The #1 support question. Symptoms: model doesn't appear in Catalog Explorer; URL contains `/ml/models/` instead of `/explore/data/models/`. - -**Steps:** - -1. Confirm the diagnosis: - - Catalog Explorer → catalog → schema → Models tab: **missing** - - MLflow icon (left sidebar) → Models: **present** - - That's the workspace registry, not UC -2. Verify registry URI in the training session - - `mlflow.get_registry_uri()` — should return `"databricks-uc"`, not a workspace URI -3. If the URI was wrong, fix it and re-register: - - Add `mlflow.set_registry_uri("databricks-uc")` at the top of the training code - - Re-run `mlflow.register_model(...)` — this creates a new entry in UC - - The orphaned workspace-registry entry can be deleted via MLflow UI (optional) -4. Set the `@champion` alias on the new UC version -5. Verify via `DESCRIBE MODEL` — see `patterns-uc-registration.md` Pattern 4 - ---- - -## Journey 5: Debug a `pyfunc.load_model` that fails or predicts wrong - -Model loaded successfully, but `.predict()` raises or produces nonsense. - -**Steps:** - -1. **Check the signature was logged:** - ```python - from mlflow.models import get_model_info - info = get_model_info("models:/@champion") - print(info.signature) - ``` - If `None` — see `GOTCHAS.md` #8. Re-log the model with `signature=infer_signature(...)`. - -2. **Check the input column order:** - ```python - expected = model.metadata.get_input_schema().input_names() - print(f"Model expects: {expected}") - print(f"You passed: {list(features_df.columns)}") - ``` - If the order differs, pass `features_df[expected]`. - -3. **Check preprocessing coverage:** - - Does the training notebook call a scaler / encoder / imputer before fitting? - - Is that preprocessing in the logged artifact? - - If not — see `GOTCHAS.md` #12. Re-train with preprocessing wrapped in `sklearn.Pipeline`. - -4. **Check for type coercion:** - - Integer column becoming float (or vice versa) — fine for sklearn, sometimes breaks for xgboost/pytorch - - Categorical as string vs int — depends on the flavor - - Fix: cast `features_df` to match `model.metadata.get_input_schema()` dtypes before predicting - ---- - -## Journey 6: Schema evolution — your features changed since the model was logged - -The silver features pipeline added a new column. Your deployed `@champion` model was trained without it. Predictions still work (extra columns are ignored), but you want to include the new feature. - -**Steps:** - -1. Retrain with the new feature: - ```python - # Same Journey 1 steps, but with expanded feature set - mlflow.sklearn.log_model( - sk_model=new_pipeline, - artifact_path="model", - signature=infer_signature(X_train_expanded, new_pipeline.predict(X_train_expanded[:5])), - input_example=X_train_expanded.iloc[:5], - ) - ``` -2. Register as a new version -3. Validate via A/B (Journey 2) -4. Promote to `@champion` - -Schema changes are always a new version. Never mutate a logged model in place. - ---- - -## Journey 7: "Everything is on fire, I have 10 minutes to demo" - -Someone registered a fallback model. Load it. - -```python -import mlflow -mlflow.set_registry_uri("databricks-uc") -model = mlflow.pyfunc.load_model( - "models:/..@fallback" -) -features = spark.table("..sample_features").limit(500).toPandas() -features["prediction"] = model.predict(features) -display(spark.createDataFrame(features)) -``` - -Every escape-hatch pattern should pre-register a `@fallback` version for exactly this case. - ---- - -## When to use which journey - -| Situation | Journey | -|-----------|---------| -| I'm starting from zero | 1 | -| I have `@champion`, trained something new | 2 | -| I want predictions in a scheduled table | 3 | -| Registered but can't find in Catalog Explorer | 4 | -| `load_model` succeeds but `predict` fails | 5 | -| My features changed | 6 | -| Demo in 10 minutes, nothing works | 7 | From b424134f23d9e4a81a3f93a3f8beb5c1ef600248 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sat, 9 May 2026 15:26:49 +1000 Subject: [PATCH 5/5] =?UTF-8?q?chore(mlflow-ml):=20rename=20GOTCHAS.md=20?= =?UTF-8?q?=E2=86=92=20gotchas.md=20(case=20fix)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit macOS case-insensitive filesystem hid this from the previous commit. The content was already lowercased in references; this commit makes the git index match. Co-authored-by: Isaac --- .../databricks-mlflow-ml/references/{GOTCHAS.md => gotchas.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename databricks-skills/databricks-mlflow-ml/references/{GOTCHAS.md => gotchas.md} (100%) diff --git a/databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md b/databricks-skills/databricks-mlflow-ml/references/gotchas.md similarity index 100% rename from databricks-skills/databricks-mlflow-ml/references/GOTCHAS.md rename to databricks-skills/databricks-mlflow-ml/references/gotchas.md