From f628316a89f6585bd585ee88079362dbccd97673 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sun, 3 May 2026 21:55:35 +1000 Subject: [PATCH] feat(skills): add databricks-lakebase-migration skill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures the Provisioned → Autoscaling migration mechanics that aren't in the public docs as of 2026-05. Covers: - pg_dump/pg_restore workflow with the --role flag for SP ownership - Why raw CREATE DATABASE breaks app SP OAuth (missing databricks_auth + neon extensions) and how to recover - databricks_create_role(, 'SERVICE_PRINCIPAL') as the proper way to register an app SP for OAuth-token resolution - The Apps API update-mask limitation when re-pointing database resources via bundle deploy, and the direct apps update --json workaround - Synced-table snapshot semantics (UC sync pipelines do not auto-follow) - Step-by-step runbook plus a common-issues table covering every failure mode hit during a real lakemeter migration Also updates databricks-skills/README.md to list the existing databricks-lakebase-autoscale skill (was missing) and the new migration skill, and adds them to install_skills.sh. Co-authored-by: Isaac --- databricks-skills/README.md | 4 +- .../databricks-lakebase-migration/SKILL.md | 447 ++++++++++++++++++ databricks-skills/install_skills.sh | 3 +- 3 files changed, 452 insertions(+), 2 deletions(-) create mode 100644 databricks-skills/databricks-lakebase-migration/SKILL.md diff --git a/databricks-skills/README.md b/databricks-skills/README.md index a81730a2..ca08cc1b 100644 --- a/databricks-skills/README.md +++ b/databricks-skills/README.md @@ -105,7 +105,9 @@ cp -r ai-dev-kit/databricks-skills/databricks-agent-bricks .claude/skills/ - **databricks-app-python** - Python web apps (Dash, Streamlit, Flask) with foundation model integration - **databricks-python-sdk** - Python SDK, Connect, CLI, REST API - **databricks-config** - Profile authentication setup -- **databricks-lakebase-provisioned** - Managed PostgreSQL for OLTP workloads +- **databricks-lakebase-provisioned** - Managed PostgreSQL for OLTP workloads (legacy fixed-capacity model) +- **databricks-lakebase-autoscale** - Next-gen managed PostgreSQL with autoscaling, scale-to-zero, branching +- **databricks-lakebase-migration** - Migrate from Lakebase Provisioned → Autoscaling via pg_dump/pg_restore ### 📚 Reference - **databricks-docs** - Documentation index via llms.txt diff --git a/databricks-skills/databricks-lakebase-migration/SKILL.md b/databricks-skills/databricks-lakebase-migration/SKILL.md new file mode 100644 index 00000000..0433424a --- /dev/null +++ b/databricks-skills/databricks-lakebase-migration/SKILL.md @@ -0,0 +1,447 @@ +--- +name: databricks-lakebase-migration +description: "Migrate data and apps from Lakebase Provisioned to Lakebase Autoscaling. Use when planning or executing a Provisioned → Autoscaling cutover, dumping/restoring a Lakebase database via pg_dump, registering Service Principal roles on a new instance, re-pointing a Databricks App's database resource binding, or working around the gotchas in raw SQL bootstrap of a destination database." +--- + +# Lakebase Migration (Provisioned → Autoscaling) + +Mechanics and gotchas for migrating an existing Lakebase Provisioned database +to a new Lakebase Autoscaling project via `pg_dump` / `pg_restore`. Direct +in-place migration is **not currently supported** by Databricks; this is the +sanctioned manual path until one-click migration ships. + +## When to Use + +Use this skill when: + +- A Lakebase Provisioned instance backs a Databricks App and you want to move + to Autoscaling for scale-to-zero, branching, or instant-restore. +- You hit `password authentication failed` after pointing an app at a freshly + bootstrapped Autoscaling database. +- You need to restore a `pg_dump` snapshot into a fresh Lakebase database and + have it work for app-SP OAuth connections. +- You're updating a Databricks App's `database` resource to point at a new + instance and the bundle deploy fails on the update mask. + +Also use the related skills: +- [databricks-lakebase-provisioned](../databricks-lakebase-provisioned/SKILL.md) — source-side mechanics +- [databricks-lakebase-autoscale](../databricks-lakebase-autoscale/SKILL.md) — destination-side mechanics + +## Overview + +| Aspect | Details | +|--------|---------| +| **In-place upgrade** | Not supported as of 2026-05. A one-click migration is on the roadmap; timing TBD. | +| **Recommended path** | Snapshot via `pg_dump` (custom format) → restore into a new Autoscaling project. | +| **Downtime** | Roughly 5-15 minutes for a sub-100 MB database; dominated by app-redeploy time. | +| **Reversibility** | High — keep the source database until you've soaked the destination. Rollback = revert one bundle var + one apps update + restart app. | +| **Synced tables** | UC sync pipelines do **not** auto-follow; copied data is a frozen snapshot. Re-wiring is a separate workstream. | + +## Pre-flight checklist + +- [ ] Local `pg_dump --version` and `pg_restore --version` ≥ 16 (matches Lakebase PG_VERSION_16) +- [ ] You have **Database superuser** on the new Autoscaling instance (workspace UI: **Compute → Database Instances → … → Permissions**) +- [ ] You know the target app's **service_principal_client_id** (`databricks apps get ` → `service_principal_client_id`) +- [ ] No active writes against the source DB during cutover (or accept a small window of orphaned writes) + +## The five gotchas you will hit + +These are not in the public docs as of 2026-05. They are the difference +between a 30-minute migration and a 3-hour one. + +### 1. Don't `CREATE DATABASE` with raw SQL — use the Databricks Database API + +Creating a database via `psql -c "CREATE DATABASE foo;"` skips Lakebase's +managed-creation flow and leaves the database **without the +`databricks_auth` and `neon` extensions**. The first symptom you'll see is +the app failing to authenticate with `password authentication failed for +user ''` — even though the role exists. + +**Fix (preferred):** create the database via the Databricks Database CLI/API, +which provisions extensions automatically: + +```bash +databricks database create-database-catalog \ + \ + --create-database-if-not-exists \ + -p +``` + +**Fix (recovery, if you already raw-`CREATE DATABASE`'d):** + +```sql +CREATE EXTENSION IF NOT EXISTS databricks_auth; +CREATE EXTENSION IF NOT EXISTS neon; +``` + +You must run these as a superuser. Without `databricks_auth`, OAuth tokens +from app SPs cannot be resolved to Postgres roles. + +### 2. App SP roles need `databricks_create_role()`, not `CREATE ROLE` + +Vanilla `CREATE ROLE "0ad623cd-..." LOGIN INHERIT` produces a role that +*looks* right but lacks the OAuth-token-resolution wiring inside +`databricks_auth`. The fix is to use the Lakebase-provided function: + +```sql +SELECT databricks_create_role( + '0ad623cd-2827-40d9-917e-1b9f824e4c57', -- service_principal_client_id + 'SERVICE_PRINCIPAL' -- identity_type +); +``` + +Verify the registration: + +```sql +SELECT * FROM databricks_list_roles + WHERE role_name='0ad623cd-2827-40d9-917e-1b9f824e4c57'; +-- expect: identity_type='service_principal' +``` + +If you've already created a vanilla role with the same name, you must +unwind it first: + +```sql +-- 1. Park ownership on a real user +REASSIGN OWNED BY "" TO ""; + +-- 2. Drop privileges that block role drop +REVOKE ALL ON DATABASE FROM ""; +REVOKE ALL ON SCHEMA FROM ""; +REVOKE ALL ON ALL TABLES IN SCHEMA FROM ""; + +-- 3. Grant role membership so DROP OWNED works +GRANT "" TO ""; + +-- 4. Drop +DROP OWNED BY ""; +DROP ROLE ""; + +-- 5. Now register properly +SELECT databricks_create_role('', 'SERVICE_PRINCIPAL'); +``` + +### 3. `pg_restore` ownership trap + +If you restore as your IdP user (e.g. `david.okeeffe@databricks.com`), every +table ends up owned by you, and the app SP cannot run DDL on its own +tables later (e.g. when the app's startup migrations try `ALTER TABLE … +ADD COLUMN`). The fix is the `--role` flag: + +```bash +pg_restore -v \ + --no-owner --no-acl \ + --role="0ad623cd-2827-40d9-917e-1b9f824e4c57" \ + -h .database.us-west-2.cloud.databricks.com \ + -U "" \ + -d \ + /path/to/dump.bak +``` + +`--role` causes every restored DDL to run via `SET ROLE` to the SP, so +tables end up SP-owned. This requires: + +1. The SP role to exist (registered via `databricks_create_role`, gotcha #2). +2. The connecting user to have `GRANT TO ` membership. + +### 4. Bundle deploy can't change `database.instance_name` on an existing app + +The Databricks Apps API doesn't accept `resources[*].database.instance_name` +in the update mask. The bundle deploy emits a deep update path and fails: + +``` +Invalid update mask. Only description, ..., resources, ... are allowed. +Supplied update mask: resources[0].database.instance_name +``` + +**Fix:** call `apps update` directly with the full `resources` array, which +*is* an allowed top-level update mask: + +```bash +databricks apps update -p --json '{ + "name": "", + "description": "", + "resources": [ + { + "name": "database", + "description": "Lakebase database for ...", + "database": { + "database_name": "", + "instance_name": "", + "permission": "CAN_CONNECT_AND_CREATE" + } + } + ] +}' +``` + +After the resource is updated, re-run `databricks bundle deploy` (or +`make ship`) — it will be a no-op on the resource and successfully sync +new code. + +### 5. Synced tables (`sync_*`) are a frozen snapshot after migration + +UC sync pipelines write to a *specific* Lakebase instance. They do not +auto-follow when you re-point an app's database resource. After the +cutover, your `sync_pricing_vm_costs`, `sync_ref_*`, `sync_salesforce_*` +tables exist on the new instance with the data they had at dump time, but +no further updates flow in. + +**Two options:** +- **Path A (snapshot):** accept frozen data, refresh manually when needed. + Suitable for SE/demo apps where freshness is best-effort. +- **Path B (re-wire):** open a ticket with the team owning the UC sync + pipeline to re-point it at the new instance. Required for any + customer-facing tool. + +Path B is the same work either way — direct migration would still need +re-wiring — so path A is a no-regret default for v1 migrations. + +## End-to-end migration runbook + +This is the full sequence proven on a 99 MB lakemeter database. Adapt the +identifiers (instance name, app name, SP UUID, schema) to your project. + +### Step 1: Create the destination Autoscaling project + +Use the CLI (auto-provisions `databricks_auth` and `neon` extensions): + +```bash +databricks database create-database-instance \ + --capacity CU_1 \ + --enable-pg-native-login \ + -p +``` + +Note the returned `read_write_dns` (looks like `ep-...database..cloud.databricks.com`). + +### Step 2: Create the destination database + +The cleanest path is to use the Databricks Database CLI to create the +database AND register it under a UC catalog at the same time: + +```bash +databricks database create-database-catalog \ + \ + --create-database-if-not-exists \ + -p +``` + +If you've already created the database via raw SQL, recover with the +extension installs in gotcha #1. + +### Step 3: pg_dump from the source + +```bash +mkdir -p /tmp/lakebase_migration && cd /tmp/lakebase_migration + +PGPASSWORD="$(databricks auth token -p | python3 -c \ + 'import sys,json; print(json.load(sys.stdin)["access_token"])')" \ +pg_dump -Fc -v \ + -n \ + --no-owner --no-acl \ + -h \ + -U "" \ + -d \ + -p 5432 \ + -f migration.bak +``` + +**Why these flags:** +- `-Fc` — custom-format archive for parallelisable restore +- `-n ` — only the app schema, not system tables +- `--no-owner --no-acl` — destination's startup privilege block re-grants + +### Step 4: Bootstrap the destination schema and SP role + +```sql +-- Connected as superuser to on the new instance + +-- (Only if Step 2 used raw CREATE DATABASE) +CREATE EXTENSION IF NOT EXISTS databricks_auth; +CREATE EXTENSION IF NOT EXISTS neon; + +-- Register the app SP for OAuth resolution +SELECT databricks_create_role( + '', + 'SERVICE_PRINCIPAL' +); + +-- Grant yourself membership so --role on pg_restore works +GRANT "" TO ""; + +-- Create the schema owned by the SP +CREATE SCHEMA IF NOT EXISTS + AUTHORIZATION ""; +``` + +### Step 5: pg_restore with `--role` + +```bash +PGPASSWORD="$(databricks auth token -p | python3 -c \ + 'import sys,json; print(json.load(sys.stdin)["access_token"])')" \ +pg_restore -v \ + --no-owner --no-acl \ + --role="" \ + -h \ + -U "" \ + -d \ + /tmp/lakebase_migration/migration.bak +``` + +A handful of cosmetic warnings are normal: +- `unrecognized configuration parameter "transaction_timeout"` — Neon doesn't honor it +- `permission denied for database ` — `COMMENT ON DATABASE`; ignore +- `Databricks SyncedTable` — synced-table metadata didn't transfer (gotcha #5) + +### Step 6: Verify row counts match + +```sql +SELECT 'users' AS t, COUNT(*) FROM .users +UNION ALL +SELECT 'estimates' , COUNT(*) FROM .estimates +UNION ALL +SELECT 'sync_pricing_*' , COUNT(*) FROM .sync_pricing_<...>; +``` + +**STOP if counts diverge.** Don't proceed to cutover. + +### Step 7: Patch the bundle config + +In `databricks.yml`, change the v2 (or whichever) target: + +```diff + v2: + variables: + app_name: "" +- lakebase_instance_name: "" ++ lakebase_instance_name: "" + lakebase_database_name: "" +``` + +### Step 8: Update the app's database resource (workaround for gotcha #4) + +```bash +databricks apps update -p --json '{ + "name": "", + "description": "", + "resources": [ + { + "name": "database", + "description": "", + "database": { + "database_name": "", + "instance_name": "", + "permission": "CAN_CONNECT_AND_CREATE" + } + } + ] +}' +``` + +### Step 9: Stop, deploy, start + +```bash +databricks apps stop -p +make ship TARGET= PROFILE= # or `databricks bundle deploy -t ` +databricks apps start -p +``` + +### Step 10: Verify the app boots and authenticates + +Tail logs and look for the `[TokenManager]` line plus a clean Uvicorn +startup with **no** `password authentication failed` errors: + +```bash +databricks apps logs -p | tail -50 | \ + grep -iE "error|password|connect|started|uvicorn" +``` + +Then hit the app URL in a browser. If the app's startup migrations need +to add new columns / GRANT privileges, those should now succeed because +the SP owns the schema. + +### Step 11: Soak before decommissioning the source + +Keep the old database for at least 7 days. Rollback is fast: + +```bash +# Revert the one-line bundle config change +git checkout databricks.yml + +# Re-point app via direct apps update (gotcha #4) +databricks apps update -p --json '{ ... old instance ... }' + +# Restart +databricks apps stop -p +databricks apps start -p +``` + +When you're confident, drop the source database: + +```sql +-- Connected as superuser to databricks_postgres on the OLD instance +DROP DATABASE ; +``` + +## Common Issues + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `password authentication failed for user ''` | `databricks_auth` extension missing OR SP role not registered via `databricks_create_role` | Gotchas #1, #2 | +| `must be able to SET ROLE ""` on `ALTER OWNER` | Your IdP user lacks role membership | `GRANT "" TO ""` | +| `permission denied for schema ` during restore | Schema owner mismatch or missing GRANT | Re-create the schema with `AUTHORIZATION ""` before restore | +| `permission denied to drop objects` | You revoked role membership before dropping owned objects | Re-grant role to yourself, then `DROP OWNED BY` first, then `DROP ROLE` | +| `role "" cannot be dropped because some objects depend on it` (after `REASSIGN OWNED`) | DB-level privileges weren't revoked | `REVOKE ALL ON DATABASE FROM ""` | +| Bundle deploy fails with `Invalid update mask: resources[0].database.instance_name` | Apps API doesn't allow deep paths | Gotcha #4 — direct `apps update --json` | +| Cosmetic `Could not create schema (may already exist)` warning at app boot | App's `_init_schemas()` calls `CREATE SCHEMA IF NOT EXISTS` but SP doesn't own the database | Harmless; same warning appears on the source instance | +| `Databricks SyncedTable` warning during restore | Synced-table metadata doesn't transfer | Gotcha #5 — re-wire UC sync pipelines after cutover | +| Cost calc / lookups against `sync_*` tables stale after migration | UC sync wasn't re-pointed at new instance | Gotcha #5 | + +## What this migration does NOT cover + +- **Live replication.** This is a snapshot migration. For zero-downtime, + consider [logical replication](https://www.postgresql.org/docs/current/logical-replication.html) + via a dedicated replication slot — Lakebase Autoscaling supports it, but + that's a separate workstream. +- **UC catalog re-binding.** The existing UC catalog (e.g. + `_catalog`) was bound to the old instance. Either create a new + catalog on the new instance (via `databricks database + create-database-catalog`) or live with the old binding until you + decommission. Don't delete-and-recreate the existing catalog if + another target's database lives on the same instance. +- **Onboarding new SPs after migration.** If a new app SP needs to access + the migrated database later, register it the same way: + `SELECT databricks_create_role('', 'SERVICE_PRINCIPAL');` +- **Provisioned-side cleanup.** Decommission only after a soak window. The + source data is unchanged by the migration, so rollback is non-destructive. + +## Mapping reference + +When sizing the destination, use the [official mapping table](https://docs.databricks.com/aws/en/oltp/upgrade-to-autoscaling) +between Provisioned capacity units and Autoscaling CU ranges: + +| Provisioned (1 CU = 16 GB) | Autoscaling min CU | Autoscaling max CU | +|---|---|---| +| CU_1 (16 GB) | 4 (8 GB) | 8 (16 GB) | +| CU_2 (32 GB) | 8 (16 GB) | 16 (32 GB) | +| CU_4 (64 GB) | 16 (32 GB) | 32 (64 GB) | +| CU_8 (128 GB) | 64 (128 GB) | 64 (128 GB, fixed) | + +Note: 1 Provisioned CU = 16 GB RAM, 1 Autoscaling CU = 2 GB RAM. The unit +was redefined; raw CU counts don't compare directly across versions. + +## Related Skills + +- **[databricks-lakebase-provisioned](../databricks-lakebase-provisioned/SKILL.md)** — source-side patterns for the legacy fixed-capacity model +- **[databricks-lakebase-autoscale](../databricks-lakebase-autoscale/SKILL.md)** — destination-side patterns including projects, branches, computes +- **[databricks-app-python](../databricks-app-python/SKILL.md)** — apps that connect to Lakebase via OAuth tokens +- **[databricks-bundles](../databricks-bundles/SKILL.md)** — Asset Bundle config for the app's `database` resource +- **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** — `w.database` (Provisioned) vs `w.postgres` (Autoscaling) clients + +## Notes + +- **Postgres extensions installed by Lakebase managed flow:** `databricks_auth` (OAuth bridge), `neon` (engine), `plpgsql` (PL/pgSQL). Raw `CREATE DATABASE` only installs `plpgsql`. +- **`databricks_list_roles`** is a view installed by `databricks_auth` — use it to see *registered* roles. `pg_roles` may show roles that aren't OAuth-resolvable. +- **The cosmetic restore warnings** (`transaction_timeout`, `permission denied for database`, `SyncedTable`) are not fatal but always appear on this path. Don't treat them as failures. +- **Idempotent retries:** all SQL in Step 4 is safe to re-run (uses `IF NOT EXISTS` / catches duplicates). The `pg_restore` itself is not idempotent — running it twice produces "already exists" errors that are recoverable but noisy. diff --git a/databricks-skills/install_skills.sh b/databricks-skills/install_skills.sh index 0fc2e1d2..f902747d 100755 --- a/databricks-skills/install_skills.sh +++ b/databricks-skills/install_skills.sh @@ -47,7 +47,7 @@ MLFLOW_REPO_RAW_URL="https://raw.githubusercontent.com/mlflow/skills" MLFLOW_REPO_REF="main" # Databricks skills (hosted in this repo) -DATABRICKS_SKILLS="databricks-agent-bricks databricks-ai-functions databricks-aibi-dashboards databricks-bundles databricks-app-python databricks-config databricks-dbsql databricks-docs databricks-genie databricks-iceberg databricks-jobs databricks-lakebase-autoscale databricks-lakebase-provisioned databricks-metric-views databricks-mlflow-evaluation databricks-model-serving databricks-python-sdk databricks-execution-compute databricks-spark-declarative-pipelines databricks-spark-structured-streaming databricks-synthetic-data-gen databricks-unity-catalog databricks-unstructured-pdf-generation databricks-vector-search databricks-zerobus-ingest spark-python-data-source" +DATABRICKS_SKILLS="databricks-agent-bricks databricks-ai-functions databricks-aibi-dashboards databricks-bundles databricks-app-python databricks-config databricks-dbsql databricks-docs databricks-genie databricks-iceberg databricks-jobs databricks-lakebase-autoscale databricks-lakebase-migration databricks-lakebase-provisioned databricks-metric-views databricks-mlflow-evaluation databricks-model-serving databricks-python-sdk databricks-execution-compute databricks-spark-declarative-pipelines databricks-spark-structured-streaming databricks-synthetic-data-gen databricks-unity-catalog databricks-unstructured-pdf-generation databricks-vector-search databricks-zerobus-ingest spark-python-data-source" # MLflow skills (fetched from mlflow/skills repo) MLFLOW_SKILLS="agent-evaluation analyze-mlflow-chat-session analyze-mlflow-trace instrumenting-with-mlflow-tracing mlflow-onboarding querying-mlflow-metrics retrieving-mlflow-traces searching-mlflow-docs" @@ -83,6 +83,7 @@ get_skill_description() { "databricks-execution-compute") echo "Execute code and manage compute on Databricks - serverless, clusters, and SQL warehouses" ;; "databricks-unity-catalog") echo "System tables for lineage, audit, billing" ;; "databricks-lakebase-autoscale") echo "Lakebase Autoscale - managed PostgreSQL with autoscaling" ;; + "databricks-lakebase-migration") echo "Lakebase Provisioned → Autoscaling migration via pg_dump/pg_restore" ;; "databricks-lakebase-provisioned") echo "Lakebase Provisioned - data connections and reverse ETL" ;; "databricks-metric-views") echo "Unity Catalog Metric Views - governed business metrics in YAML" ;; "databricks-model-serving") echo "Model Serving - deploy MLflow models and AI agents" ;;