diff --git a/databricks-skills/databricks-mlflow-ml/SKILL.md b/databricks-skills/databricks-mlflow-ml/SKILL.md new file mode 100644 index 00000000..26286e8c --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/SKILL.md @@ -0,0 +1,91 @@ +--- +name: databricks-mlflow-ml +description: "Classic ML model lifecycle on Databricks with MLflow and Unity Catalog. Use when training scikit-learn / XGBoost / PyTorch models with MLflow tracking, registering models to Unity Catalog (three-level names, @champion / @challenger aliases), setting mlflow.set_registry_uri('databricks-uc'), logging experiments with UC volume artifact_location, loading registered models via mlflow.pyfunc.load_model or mlflow.pyfunc.spark_udf, and running batch inference (notebook or Lakeflow SDP pipeline). Not for GenAI agent evaluation — use databricks-mlflow-evaluation for that. Not for Model Serving endpoints — use databricks-model-serving for that." +--- + +# MLflow + Unity Catalog — Classic ML + +Read this file fully; consult `references/gotchas.md` before writing UC code; consult `references/recipes.md` only for the alias-swap and `spark_udf` patterns. + +If you're tempted to read `patterns-training.md`, `patterns-experiment-setup.md`, `patterns-uc-registration.md`, or `patterns-batch-inference.md` to figure out basic sklearn training, stop — you don't need them. This skill is only about the Databricks / Unity Catalog parts that are easy to miss. + +## Why This Skill Exists + +Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**. + +| Skill | Scope | MLflow API Surface | +|-------|-------|--------------------| +| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces | +| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` | +| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` | + +Use this skill when training forecasting / classification / regression models, registering them to Unity Catalog, and scoring them in a notebook or Lakeflow pipeline. Do not use it for GenAI evaluation or Model Serving endpoint management. + +## Hard Rules + +1. Call `mlflow.set_registry_uri("databricks-uc")` before registering or loading UC models. +2. UC model names are always three-level: `catalog.schema.model_name`. +3. Load by alias, not version: `models:/catalog.schema.model@champion`, not `models:/catalog.schema.model/3`. +4. In UC-enforced workspaces, experiments need `artifact_location="dbfs:/Volumes////"`. +5. `register_model` creates a version; it does **not** set `@champion` or `@challenger`. +6. Use aliases for lifecycle. Legacy stages like `Production` / `Staging` are deprecated for UC models. + +## Quick Start + +Minimum viable path from trained model object to UC-registered, notebook-scored model: + +```python +import mlflow +import mlflow.sklearn +from mlflow import MlflowClient +from mlflow.models import infer_signature + +CATALOG = "my_catalog" +SCHEMA = "my_schema" +MODEL_NAME = f"{CATALOG}.{SCHEMA}.my_model" + +# 1. Configure UC registry + UC volume-backed experiment. +mlflow.set_registry_uri("databricks-uc") +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location=f"dbfs:/Volumes/{CATALOG}/{SCHEMA}/mlflow_artifacts/forecasting", +) + +# 2. Train + log. Use name="model" in MLflow 3.x; artifact_path="model" only for older code. +with mlflow.start_run() as run: + model.fit(X_train, y_train) + signature = infer_signature(X_train, model.predict(X_train[:5])) + + mlflow.sklearn.log_model( + sk_model=model, # log the full Pipeline if preprocessing exists + name="model", + signature=signature, + input_example=X_train.iloc[:5], + ) + +# 3. Register + set alias. register_model returns a ModelVersion; alias is a separate call. +result = mlflow.register_model( + model_uri=f"runs:/{run.info.run_id}/model", + name=MODEL_NAME, +) +MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", result.version) + +# 4. Load by alias, never by hard-coded version. +loaded = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion") +predictions = loaded.predict(X_test) +``` + +## Decision Table + +| Situation | Do this | +|-----------|---------| +| Starting a first UC-registered classic ML model | Quick Start, then `recipes.md` §1–2; check `gotchas.md` #1, #2, #4, #7 | +| Model registered but missing from Catalog Explorer | Diagnose `set_registry_uri` and three-level names in `gotchas.md` #1–2 | +| Need notebook batch scoring | Use `mlflow.pyfunc.load_model("models:/catalog.schema.model@champion")`; keep the alias rule above | +| Need scheduled / distributed batch scoring in Lakeflow SDP | Use `recipes.md` §3 and `gotchas.md` #11; construct `spark_udf` at module scope | +| Retrained a challenger and need promotion | Use `recipes.md` §4 exactly; delete old `@champion` before setting new `@champion` | +| Load or predict behaves oddly | Use `recipes.md` §5 for `get_model_info` / signature checks, then `gotchas.md` for UC-specific failures | + +## Runtime Compatibility + +MLflow 3.x prefers `name=` in `log_model`; MLflow 2.x examples often use `artifact_path=`, which works but warns in newer versions. UC model stages are deprecated across modern Databricks runtimes; use aliases. diff --git a/databricks-skills/databricks-mlflow-ml/references/gotchas.md b/databricks-skills/databricks-mlflow-ml/references/gotchas.md new file mode 100644 index 00000000..586b8ce6 --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/gotchas.md @@ -0,0 +1,161 @@ +# Databricks / Unity Catalog Gotchas + +Only the Databricks + Unity Catalog-specific failures are here. Generic MLflow, sklearn, and modeling advice intentionally lives elsewhere. + +## Runtime Gotcha Matrix + +| Area | MLflow 2.x | MLflow 3.x / newer Databricks guidance | +|------|------------|-----------------------------------------| +| Model artifact argument | `artifact_path="model"` is common | Prefer `name="model"`; `artifact_path` warns and may disappear later | +| UC lifecycle | Stages already deprecated for UC | Use aliases only: `@champion`, `@challenger`, custom aliases | +| Registry target | Workspace registry remains default unless changed | Still call `mlflow.set_registry_uri("databricks-uc")` explicitly | + +--- + +## 1. Missing `mlflow.set_registry_uri("databricks-uc")` + +**How it fails:** Silent. `register_model` succeeds, but the model lands in the legacy workspace registry, not Unity Catalog; Catalog Explorer cannot find it. + +**Fix:** call this before any register or load: + +```python +mlflow.set_registry_uri("databricks-uc") +assert mlflow.get_registry_uri() == "databricks-uc" +``` + +**Why:** MLflow keeps workspace-registry defaults for backward compatibility, so the API call can succeed in the wrong registry. + +--- + +## 2. Not using a three-level UC model name + +**How it fails:** Loud with UC registry (`INVALID_PARAMETER_VALUE`), but silent-wrong if you also forgot `set_registry_uri`: two-level names can register to the workspace registry. + +**Fix:** always use `catalog.schema.model_name`. + +```python +# Wrong +"my_model" +"my_schema.my_model" + +# Correct +"my_catalog.my_schema.my_model" +``` + +**Why:** Unity Catalog models are securable objects under a catalog and schema; workspace-registry names are not. + +--- + +## 3. Experiment artifact location is not a UC volume + +**How it fails:** Usually loud later, not at setup: `log_model` or artifact upload fails with storage / permission errors. In older patterns, artifacts may silently land in DBFS root, which breaks UC governance expectations. + +**Fix:** set a UC volume-backed artifact location when creating the experiment. + +```python +mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", +) +``` + +**Why:** UC-enforced workspaces reject unmanaged DBFS-root artifact writes; UC volumes keep model artifacts governed and loadable. + +--- + +## 4. Using legacy `Production` / `Staging` stages + +**How it fails:** Silent or misleading. Stage APIs such as `transition_model_version_stage()` are deprecated / ineffective for UC models; aliases named `"Production"` may exist as labels but are not treated as lifecycle stages. + +**Fix:** use UC aliases by convention: + +```python +MlflowClient().set_registered_model_alias(name, "champion", version) +MlflowClient().set_registered_model_alias(name, "challenger", version) +``` + +**Why:** Unity Catalog model lifecycle moved from stages to free-form aliases; downstream loaders should use `models:/name@champion`. + +--- + +## 5. Missing `CREATE MODEL ON SCHEMA` + +**How it fails:** Loud. `register_model` raises `PERMISSION_DENIED: User ... does not have CREATE MODEL permission`. + +**Fix:** ask the schema owner for the schema-level model-creation grant. + +```sql +GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `user@company.com`; +SHOW GRANTS ON SCHEMA my_catalog.my_schema; +``` + +**Why:** `USE CATALOG` and `USE SCHEMA` are not enough; model creation is a separate UC privilege. + +--- + +## 6. Assuming `ai_query` is batch inference for custom UC models + +**How it fails:** Loud or wrong-primitive. `ai_query` calls serving endpoints; a UC-registered custom model is not automatically a serving endpoint. + +**Fix:** for batch inference, use: + +```python +mlflow.pyfunc.load_model("models:/catalog.schema.model@champion") # notebook / pandas path +mlflow.pyfunc.spark_udf(spark, "models:/catalog.schema.model@champion", result_type="double") +``` + +**Why:** registration and serving are separate. `ai_query` belongs to Model Serving / Foundation Model endpoint workflows, not ordinary UC batch scoring. + +--- + +## 7. Constructing `spark_udf` inside a Lakeflow SDP function + +**How it fails:** Often loud and slow: repeated model deserialization, serialization errors, or pipeline refreshes that hang / retry. Sometimes just silently expensive. + +**Fix:** construct the UDF once at module scope and call it inside `@dp.table` / `@dp.materialized_view`. + +```python +mlflow.set_registry_uri("databricks-uc") +predict_udf = mlflow.pyfunc.spark_udf( + spark, + "models:/catalog.schema.model@champion", + result_type="double", +) +``` + +**Why:** Lakeflow SDP can evaluate dataset functions repeatedly; model loading belongs at module import time, not inside the dataset function body. + +--- + +## 8. Missing `mlflow[databricks]` extras outside Databricks compute + +**How it fails:** Loud. Local laptop / CI / non-Databricks jobs may train and log, then fail on UC registration with missing cloud SDK imports such as `azure`, `boto3`, or `google.cloud`. + +**Fix:** + +```bash +pip install 'mlflow[databricks]' +# or +pip install 'mlflow-skinny[databricks]' +``` + +**Why:** UC registration stages artifacts through cloud-managed storage; the Databricks extras include the provider SDKs that plain `mlflow` may omit. + +--- + +## 9. Using deprecated `artifact_path=` instead of `name=` + +**How it fails:** Noisy now, possibly loud later. Newer MLflow warns that `artifact_path` is deprecated; future major versions may remove it. + +**Fix:** prefer: + +```python +mlflow.sklearn.log_model( + sk_model=model, + name="model", + signature=signature, + input_example=input_example, +) +``` + +**Why:** MLflow renamed the within-run model artifact argument; the value still becomes the path used by `runs://model`. diff --git a/databricks-skills/databricks-mlflow-ml/references/recipes.md b/databricks-skills/databricks-mlflow-ml/references/recipes.md new file mode 100644 index 00000000..db326fad --- /dev/null +++ b/databricks-skills/databricks-mlflow-ml/references/recipes.md @@ -0,0 +1,233 @@ +# UC-Specific Recipes + +These are code shapes, not full sklearn implementations. Use them to get Databricks / Unity Catalog arguments and ordering right. + +## 1. Experiment + UC Volume Setup + +Do this before training if the workspace enforces Unity Catalog storage. + +- Set the registry URI every session: + ```python + mlflow.set_registry_uri("databricks-uc") + ``` +- Create the artifact volume once per schema: + ```sql + CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.mlflow_artifacts; + ``` +- Create / select the experiment with a UC volume artifact location: + ```python + mlflow.set_experiment( + experiment_name="/Users/me@company.com/forecasting", + artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting", + ) + ``` + +If the experiment already exists with a non-UC artifact location, create a new experiment path. Do not try to move MLflow artifacts manually; run metadata already points at the original location. + +## 2. Log → Register → Alias + +### Logging UC essentials + +When logging the model: + +- Include `signature=infer_signature(X_train, model.predict(X_train[:5]))`. +- Include `input_example=X_train.iloc[:5]` or equivalent real rows. +- Use `name="model"` for MLflow 3.x / newer code; `artifact_path="model"` is the older spelling. +- If preprocessing exists, log the whole pipeline / wrapper, not just the final estimator. + +Shape: + +```python +with mlflow.start_run() as run: + # train your estimator or pipeline here + mlflow..log_model( + =model_or_pipeline, + name="model", + signature=signature, + input_example=input_example, + ) +``` + +### Register + champion alias + +After training: + +```python +result = mlflow.register_model( + f"runs:/{run_id}/model", + "my_catalog.my_schema.my_model", +) +MlflowClient().set_registered_model_alias( + "my_catalog.my_schema.my_model", + "champion", + result.version, +) +``` + +`register_model` returns a `ModelVersion`; `result.version` is a string such as `"1"`. It does **not** set aliases — the alias call is separate and required. + +### Tags syntax + +Tags can be set at registration time: + +```python +result = mlflow.register_model( + f"runs:/{run_id}/model", + MODEL_NAME, + tags={"dataset_version": "2024-Q4", "trained_by": "forecasting_team"}, +) +``` + +Or after registration: + +```python +client.set_registered_model_tag(MODEL_NAME, "domain", "retail") +client.set_model_version_tag(MODEL_NAME, result.version, "reviewed", "true") +``` + +### Minimal UC permission checklist + +| Operation | Required UC privilege | +|-----------|-----------------------| +| First registration of a model in a schema | `CREATE MODEL ON SCHEMA catalog.schema` | +| Registering a new version | `EDIT ON MODEL catalog.schema.model` | +| Setting aliases / tags | `EDIT ON MODEL catalog.schema.model` | +| Loading for inference | `EXECUTE ON MODEL catalog.schema.model` plus `USE CATALOG` / `USE SCHEMA` | + +## 3. Lakeflow SDP `spark_udf` Shape + +For Lakeflow SDP, create the UDF at module scope, not inside the decorated dataset function. + +```python +# src/gold/score_model.py +import mlflow +import databricks.declarative_pipelines as dp + +mlflow.set_registry_uri("databricks-uc") + +MODEL_NAME = "my_catalog.my_schema.my_model" + +predict_udf = mlflow.pyfunc.spark_udf( + spark, + model_uri=f"models:/{MODEL_NAME}@champion", + result_type="double", + env_manager="local", +) + +@dp.materialized_view +def gold_predictions(): + return ( + spark.read.table("my_catalog.my_schema.silver_features") + .withColumn( + "prediction", + predict_udf("feature_a", "feature_b", "feature_c"), + ) + ) +``` + +Pass feature columns in the order expected by the model signature. + +`result_type` shapes: + +| Model output | `result_type` | +|--------------|---------------| +| Single numeric prediction | `"double"` | +| Integer class id | `"long"` | +| String class label | `"string"` | +| Multi-output numeric vector | `"array"` | +| Named outputs | `StructType([...])` | + +Do not use `ai_query` here unless you have explicitly deployed a Model Serving endpoint. + +## 4. A/B Promotion Alias Swap + +This order is intentional: delete old `@champion` before setting the new one. Otherwise, during a botched sequence or retry, the pre-existing alias can still point consumers at the wrong version. + +```python +from mlflow import MlflowClient + +client = MlflowClient() +MODEL_NAME = "my_catalog.my_schema.my_model" + +model = client.get_registered_model(MODEL_NAME) +old_champion = model.aliases.get("champion") +new_champion = model.aliases.get("challenger") + +if new_champion is None: + raise RuntimeError("No @challenger alias set; nothing to promote") + +# Optional: preserve an explicit rollback handle before moving champion. +if old_champion: + client.set_registered_model_alias( + MODEL_NAME, + f"archived_{old_champion}", + old_champion, + ) + +# Required order: remove old champion, then set new champion. +if old_champion: + client.delete_registered_model_alias(MODEL_NAME, "champion") + +client.set_registered_model_alias(MODEL_NAME, "champion", new_champion) + +# Remove challenger after it has become champion. +client.delete_registered_model_alias(MODEL_NAME, "challenger") +``` + +Downstream code using `models:/my_catalog.my_schema.my_model@champion` picks up the new version on next load. No loader code changes. + +Rollback shape: + +```python +client.delete_registered_model_alias(MODEL_NAME, "champion") +client.set_registered_model_alias(MODEL_NAME, "champion", old_champion) +``` + +## 5. Verification One-Liners + +### SQL + +```sql +DESCRIBE MODEL my_catalog.my_schema.my_model; +SHOW MODEL VERSIONS ON MODEL my_catalog.my_schema.my_model; +SHOW GRANTS ON MODEL my_catalog.my_schema.my_model; +SHOW GRANTS ON SCHEMA my_catalog.my_schema; +``` + +If `DESCRIBE MODEL` cannot find it but `register_model` succeeded, suspect the workspace-registry trap: missing `mlflow.set_registry_uri("databricks-uc")`. + +### Alias dictionary shape + +```python +model = MlflowClient().get_registered_model("my_catalog.my_schema.my_model") +model.aliases +# Expected shape: {"champion": "3", "challenger": "4"} +``` + +Use this to confirm that `@champion` exists and points at the version you intended. + +### Signature debugging + +```python +from mlflow.models import get_model_info + +info = get_model_info("models:/my_catalog.my_schema.my_model@champion") +info.signature +info.flavors +``` + +If `info.signature` is missing or does not match the DataFrame columns you pass to `predict`, re-log the model with a signature and input example. + +### Load URI sanity check + +```python +mlflow.pyfunc.load_model("models:/my_catalog.my_schema.my_model@champion") +``` + +Correct URI shape is: + +```text +models:/..@ +``` + +Avoid version-pinned loaders such as `models:/catalog.schema.model/3` unless you are doing forensic debugging.