From 8f580499c54be931e9584d1cd825d15b4b523b94 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Thu, 23 Apr 2026 19:19:22 +1000 Subject: [PATCH 1/6] docs(lakebase-autoscale): lead with canonical psycopg_pool + OAuthConnection pattern MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Restructure connection-patterns.md to match the official Databricks tutorial and databricks-ai-bridge reference implementation: - Pattern 1 (canonical, new): psycopg_pool.ConnectionPool + OAuthConnection subclass + max_lifetime=2700. Zero background threads, rotation via pool recycling. This is what docs.databricks.com's Lakebase Apps tutorial uses. - Pattern 2: SQLAlchemy do_connect event (was previously presented as the production pattern — now demoted to "alternative for apps already using SQLAlchemy async", with an explicit note it adds unnecessary complexity). - Pattern 3: Direct psycopg.connect for scripts/notebooks. - Pattern 4: Static URL for local dev. New explicit warnings: - config.token / oauth_token().access_token is WORKSPACE-scoped and will fail at Postgres login. Must use w.postgres.generate_database_credential(). - max_lifetime=3600 (the default) creates a race condition; use 2700 so the pool recycles 15 min before the 1-hour token expiry. - ENDPOINT_NAME env var must be set manually — Databricks auto-injects PGHOST/PGPORT/PGDATABASE/PGUSER/PGSSLMODE but NOT the endpoint path. Canonical sources cited: - docs.databricks.com/aws/en/oltp/projects/tutorial-databricks-apps-autoscaling - docs.databricks.com/aws/en/oltp/projects/external-apps-connect - github.com/databricks/databricks-ai-bridge (src/databricks_ai_bridge/lakebase.py) Co-authored-by: Isaac --- .../connection-patterns.md | 350 ++++++++++++------ 1 file changed, 230 insertions(+), 120 deletions(-) diff --git a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md b/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md index 398862b3..bd13c2b3 100644 --- a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md +++ b/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md @@ -2,81 +2,187 @@ ## Overview -This document covers different connection patterns for Lakebase Autoscaling, from simple scripts to production applications with token refresh. +This document covers the canonical connection patterns for Lakebase Autoscaling, ordered by recommendation: -## Authentication Methods +1. **`psycopg_pool.ConnectionPool` + `OAuthConnection`** — canonical for production Databricks Apps. Used by the [official tutorial](https://docs.databricks.com/aws/en/oltp/projects/tutorial-databricks-apps-autoscaling), the [external app SDK guide](https://docs.databricks.com/aws/en/oltp/projects/external-apps-connect), and [`databricks-ai-bridge`](https://github.com/databricks/databricks-ai-bridge/blob/main/src/databricks_ai_bridge/lakebase.py). Zero background threads — rotation is handled by pool recycling. +2. **SQLAlchemy `do_connect` event + background refresh** — alternative for apps already using SQLAlchemy async. Works but adds a background `asyncio.Task` you don't need. +3. **Direct `psycopg.connect`** — only for one-off scripts / notebooks where the session lives < 1 hour. +4. **Static URL** — local development only. + +## Authentication Lakebase Autoscaling supports two authentication methods: | Method | Token Lifetime | Best For | |--------|---------------|----------| -| **OAuth tokens** | 1 hour (must refresh) | Interactive sessions, workspace-integrated apps | +| **OAuth tokens** (`generate_database_credential`) | 1 hour, enforced at login only | Apps — rotate via pool recycling | | **Native Postgres passwords** | No expiry | Long-running processes, tools without token rotation | +**Critical distinction:** The workspace OAuth token (`w.config.oauth_token().access_token`) is *workspace-scoped* — it will **fail at PG login**. You must call `w.postgres.generate_database_credential(endpoint=...)` to mint a separate *Lakebase-scoped* JWT: + +```python +# ✅ CORRECT — Lakebase-scoped database credential +cred = w.postgres.generate_database_credential(endpoint=endpoint_name) +password = cred.token + +# ❌ WRONG — workspace-scoped token +password = w.config.oauth_token().access_token +``` + **Connection timeouts (both methods):** - **24-hour idle timeout**: Connections with no activity for 24 hours are automatically closed - **3-day maximum connection life**: Connections alive for more than 3 days may be closed Design your applications to handle connection timeouts with retry logic. -## Connection Methods +## 1. `psycopg_pool.ConnectionPool` + `OAuthConnection` (CANONICAL) + +This is the pattern from the official Databricks tutorial, external app guide, and `databricks-ai-bridge`. **Use this for any production Databricks App.** + +### How it works -### 1. Direct psycopg Connection (Simple Scripts) +1. `OAuthConnection.connect()` mints a fresh Lakebase credential every time the pool opens a new physical connection. +2. Lakebase tokens expire at 1 hour, but expiration is enforced **only at login** — already-open connections stay valid. +3. `max_lifetime=2700` (45 min) tells the pool to recycle connections before tokens expire. When the pool reopens, `OAuthConnection.connect()` fires and gets a fresh token. +4. The 15-minute buffer (60 min token − 45 min recycle) means you never race against expiry. -For one-off scripts or notebooks: +**Result:** Fully transparent token rotation with zero background tasks, zero timers, zero manual refresh logic. + +> **Why not `max_lifetime=3600` (the default)?** You'd hand out connections with nearly-expired tokens. A connection established at minute 59 with a token that expires at minute 60 will fail a minute later. Always use 2700. + +### `app.yaml` + +```yaml +command: ['flask', '--app', 'app.py', 'run', '--host', '0.0.0.0', '--port', '8000'] +env: + # These 5 are auto-injected when you add a Lakebase (postgres) resource in the UI: + # PGHOST, PGPORT, PGDATABASE, PGUSER, PGSSLMODE + # You MUST manually add ENDPOINT_NAME — it's needed by generate_database_credential(): + - name: ENDPOINT_NAME + value: 'projects//branches//endpoints/' +``` + +### `requirements.txt` + +``` +flask +psycopg[binary,pool]>=3.1.0 +databricks-sdk>=0.81.0 +``` + +### `app.py` (Flask) ```python -import psycopg +import os from databricks.sdk import WorkspaceClient +import psycopg +from psycopg_pool import ConnectionPool +from flask import Flask -def get_connection(project_id: str, branch_id: str = "production", - endpoint_id: str = None, database_name: str = "databricks_postgres"): - """Get a database connection with fresh OAuth token.""" - w = WorkspaceClient() +app = Flask(__name__) - # Get endpoint details to find the host - if endpoint_id: - ep_name = f"projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id}" - else: - # List endpoints and pick the primary R/W one - endpoints = list(w.postgres.list_endpoints( - parent=f"projects/{project_id}/branches/{branch_id}" - )) - ep_name = endpoints[0].name +# Inside Databricks Apps, WorkspaceClient() auto-authenticates via SP credentials. +w = WorkspaceClient() - endpoint = w.postgres.get_endpoint(name=ep_name) - host = endpoint.status.hosts.host - # Generate OAuth token (valid for 1 hour) - cred = w.postgres.generate_database_credential(endpoint=ep_name) +class OAuthConnection(psycopg.Connection): + """Inject a fresh Lakebase OAuth token on every pool-opened connection. - # Build connection string - conn_string = ( - f"host={host} " - f"dbname={database_name} " - f"user={w.current_user.me().user_name} " - f"password={cred.token} " - f"sslmode=require" - ) + The pool calls OAuthConnection.connect() when: + - Filling min_size on startup + - Recycling a connection (max_lifetime exceeded) + - Creating a new connection under load + - Replacing a connection that failed health-check - return psycopg.connect(conn_string) + No background refresh thread is needed: tokens are always fresh at login + time, and login is where Lakebase enforces expiration. + """ -# Usage -with get_connection("my-app") as conn: - with conn.cursor() as cur: - cur.execute("SELECT NOW()") - print(cur.fetchone()) + @classmethod + def connect(cls, conninfo='', **kwargs): + endpoint_name = os.environ["ENDPOINT_NAME"] + cred = w.postgres.generate_database_credential(endpoint=endpoint_name) + kwargs['password'] = cred.token + return super().connect(conninfo, **kwargs) + + +username = os.environ["PGUSER"] # SP client ID — auto-injected +host = os.environ["PGHOST"] # e.g. ep-restless-pond-e4wvk0yn... — auto-injected +port = os.environ.get("PGPORT", "5432") +database = os.environ["PGDATABASE"] # typically "databricks_postgres" — auto-injected +sslmode = os.environ.get("PGSSLMODE", "require") + +pool = ConnectionPool( + conninfo=( + f"dbname={database} user={username} " + f"host={host} port={port} sslmode={sslmode}" + ), + connection_class=OAuthConnection, + min_size=1, + max_size=10, + # CRITICAL: 2700 (45 min), not the 3600 default. + # Recycles connections 15 min before the 1-hour token expiry. + max_lifetime=2700, + open=True, +) + + +@app.route('/') +def index(): + with pool.connection() as conn: + with conn.cursor() as cur: + cur.execute("SELECT current_user, current_database()") + row = cur.fetchone() + return f"Connected as {row[0]} to {row[1]}" + + +if __name__ == '__main__': + app.run(host="0.0.0.0", port=8000) ``` -### 2. Connection Pool with Token Refresh (Production) +### FastAPI variant -For long-running applications that need connection pooling: +Identical pattern, but use `open=False` with an explicit lifespan so startup failures surface immediately: ```python -import asyncio -import uuid from contextlib import asynccontextmanager +from fastapi import FastAPI + +pool = ConnectionPool( + conninfo=..., + connection_class=OAuthConnection, + min_size=1, max_size=10, + max_lifetime=2700, + open=False, # Opened explicitly in lifespan +) + + +@asynccontextmanager +async def lifespan(app: FastAPI): + pool.open(wait=True, timeout=30.0) # Fail fast if DB unreachable + yield + pool.close() + + +app = FastAPI(lifespan=lifespan) + + +@app.get("/api/data") +def get_data(): # sync def — FastAPI runs in threadpool automatically + with pool.connection() as conn: + with conn.cursor() as cur: + cur.execute("SELECT ...") + return cur.fetchall() +``` + +## 2. SQLAlchemy `do_connect` Event (Alternative) + +**Use only if your app is already SQLAlchemy-async.** Otherwise prefer pattern 1 — this adds a background refresh task you don't need. + +```python +import asyncio from typing import AsyncGenerator, Optional +from contextlib import asynccontextmanager from sqlalchemy import event from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker @@ -84,7 +190,11 @@ from databricks.sdk import WorkspaceClient class LakebaseAutoscaleConnectionManager: - """Manages Lakebase Autoscaling connections with automatic token refresh.""" + """Manages Lakebase Autoscaling connections with background token refresh. + + This pattern works but adds operational complexity (a background asyncio.Task) + that isn't necessary. Prefer psycopg_pool + OAuthConnection (pattern 1). + """ def __init__( self, @@ -93,7 +203,7 @@ class LakebaseAutoscaleConnectionManager: database_name: str = "databricks_postgres", pool_size: int = 5, max_overflow: int = 10, - token_refresh_seconds: int = 3000 # 50 minutes + token_refresh_seconds: int = 3000, # 50 minutes ): self.project_id = project_id self.branch_id = branch_id @@ -107,32 +217,28 @@ class LakebaseAutoscaleConnectionManager: self._engine = None self._session_maker = None - def _generate_token(self) -> str: - """Generate fresh OAuth token.""" + def _endpoint_name(self) -> str: w = WorkspaceClient() - # Get primary endpoint name for token scoping endpoints = list(w.postgres.list_endpoints( parent=f"projects/{self.project_id}/branches/{self.branch_id}" )) - endpoint_name = endpoints[0].name if endpoints else None - cred = w.postgres.generate_database_credential(endpoint=endpoint_name) + if not endpoints: + raise RuntimeError( + f"No endpoints for projects/{self.project_id}/branches/{self.branch_id}" + ) + return endpoints[0].name + + def _generate_token(self) -> str: + w = WorkspaceClient() + cred = w.postgres.generate_database_credential(endpoint=self._endpoint_name()) return cred.token def _get_host(self) -> str: - """Get the connection host from the primary endpoint.""" w = WorkspaceClient() - endpoints = list(w.postgres.list_endpoints( - parent=f"projects/{self.project_id}/branches/{self.branch_id}" - )) - if not endpoints: - raise RuntimeError( - f"No endpoints found for projects/{self.project_id}/branches/{self.branch_id}" - ) - endpoint = w.postgres.get_endpoint(name=endpoints[0].name) - return endpoint.status.hosts.host + ep = w.postgres.get_endpoint(name=self._endpoint_name()) + return ep.status.hosts.host async def _refresh_loop(self): - """Background task to refresh token periodically.""" while True: await asyncio.sleep(self.token_refresh_seconds) try: @@ -141,48 +247,34 @@ class LakebaseAutoscaleConnectionManager: print(f"Token refresh failed: {e}") def initialize(self): - """Initialize database engine and start token refresh.""" w = WorkspaceClient() - - # Get host info host = self._get_host() username = w.current_user.me().user_name - # Generate initial token self._current_token = self._generate_token() - # Create engine (password injected via event) - url = ( - f"postgresql+psycopg://{username}@" - f"{host}:5432/{self.database_name}" - ) - + url = f"postgresql+psycopg://{username}@{host}:5432/{self.database_name}" self._engine = create_async_engine( url, pool_size=self.pool_size, max_overflow=self.max_overflow, pool_recycle=3600, - connect_args={"sslmode": "require"} + connect_args={"sslmode": "require"}, ) - # Inject token on connect @event.listens_for(self._engine.sync_engine, "do_connect") def inject_token(dialect, conn_rec, cargs, cparams): cparams["password"] = self._current_token self._session_maker = async_sessionmaker( - self._engine, - class_=AsyncSession, - expire_on_commit=False + self._engine, class_=AsyncSession, expire_on_commit=False ) def start_refresh(self): - """Start background token refresh task.""" if not self._refresh_task: self._refresh_task = asyncio.create_task(self._refresh_loop()) async def stop_refresh(self): - """Stop token refresh task.""" if self._refresh_task: self._refresh_task.cancel() try: @@ -193,73 +285,90 @@ class LakebaseAutoscaleConnectionManager: @asynccontextmanager async def session(self) -> AsyncGenerator[AsyncSession, None]: - """Get a database session.""" async with self._session_maker() as session: yield session async def close(self): - """Close all connections.""" await self.stop_refresh() if self._engine: await self._engine.dispose() +``` +## 3. Direct `psycopg.connect` (Scripts / Notebooks Only) -# Usage in FastAPI -from fastapi import FastAPI +For one-off scripts or notebooks where the process lives well under an hour: + +```python +import psycopg +from databricks.sdk import WorkspaceClient -app = FastAPI() -db_manager = LakebaseAutoscaleConnectionManager("my-app", "production", "my_database") -@app.on_event("startup") -async def startup(): - db_manager.initialize() - db_manager.start_refresh() +def get_connection(project_id: str, branch_id: str = "production", + endpoint_id: str = None, database_name: str = "databricks_postgres"): + """Get a one-shot database connection with a fresh OAuth token.""" + w = WorkspaceClient() -@app.on_event("shutdown") -async def shutdown(): - await db_manager.close() + if endpoint_id: + ep_name = f"projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id}" + else: + # Pick the first endpoint under the branch + endpoints = list(w.postgres.list_endpoints( + parent=f"projects/{project_id}/branches/{branch_id}" + )) + ep_name = endpoints[0].name -@app.get("/data") -async def get_data(): - async with db_manager.session() as session: - result = await session.execute("SELECT * FROM my_table") - return result.fetchall() -``` + endpoint = w.postgres.get_endpoint(name=ep_name) + host = endpoint.status.hosts.host -### 3. Static URL Mode (Local Development) + cred = w.postgres.generate_database_credential(endpoint=ep_name) + + return psycopg.connect( + host=host, + dbname=database_name, + user=w.current_user.me().user_name, + password=cred.token, + sslmode="require", + ) + + +# Usage +with get_connection("my-app") as conn: + with conn.cursor() as cur: + cur.execute("SELECT NOW()") + print(cur.fetchone()) +``` -For local development, use a static connection URL: +## 4. Static URL (Local Development Only) ```python import os from sqlalchemy.ext.asyncio import create_async_engine -# Set environment variable with full connection URL # LAKEBASE_PG_URL=postgresql://user:password@host:5432/database def get_database_url() -> str: - """Get database URL from environment.""" - url = os.environ.get("LAKEBASE_PG_URL") - if url and url.startswith("postgresql://"): - # Convert to psycopg3 async driver + url = os.environ.get("LAKEBASE_PG_URL", "") + if url.startswith("postgresql://"): url = url.replace("postgresql://", "postgresql+psycopg://", 1) return url + engine = create_async_engine( get_database_url(), pool_size=5, - connect_args={"sslmode": "require"} + connect_args={"sslmode": "require"}, ) ``` -### 4. DNS Resolution Workaround (macOS) +## DNS Resolution Workaround (macOS) -Python's `socket.getaddrinfo()` fails with long hostnames on macOS. Use `dig` as fallback: +Python's `socket.getaddrinfo()` can fail with long hostnames on macOS. Fall back to `dig`: ```python import subprocess import socket + def resolve_hostname(hostname: str) -> str: """Resolve hostname using dig command (macOS workaround).""" try: @@ -270,10 +379,9 @@ def resolve_hostname(hostname: str) -> str: try: result = subprocess.run( ["dig", "+short", hostname], - capture_output=True, text=True, timeout=5 + capture_output=True, text=True, timeout=5, ) - ips = result.stdout.strip().split('\n') - for ip in ips: + for ip in result.stdout.strip().split('\n'): if ip and not ip.startswith(';'): return ip except Exception: @@ -281,24 +389,26 @@ def resolve_hostname(hostname: str) -> str: raise RuntimeError(f"Could not resolve hostname: {hostname}") -# Use with psycopg + +# Use with psycopg: set `host` for TLS SNI and `hostaddr` for the actual connection conn_params = { - "host": hostname, # For TLS SNI - "hostaddr": resolve_hostname(hostname), # Actual IP + "host": hostname, + "hostaddr": resolve_hostname(hostname), "dbname": database_name, "user": username, "password": token, - "sslmode": "require" + "sslmode": "require", } conn = psycopg.connect(**conn_params) ``` ## Best Practices -1. **Always use SSL**: Set `sslmode=require` in all connections -2. **Implement token refresh**: Tokens expire after 1 hour; refresh at 50 minutes -3. **Use connection pooling**: Avoid creating new connections per request -4. **Handle DNS issues on macOS**: Use the `hostaddr` workaround if needed -5. **Close connections properly**: Use context managers or explicit cleanup -6. **Handle scale-to-zero wake-up**: First connection after idle may take 2-5 seconds -7. **Log token refresh events**: Helps debug authentication issues +1. **Default to pattern 1** (`psycopg_pool.ConnectionPool` + `OAuthConnection`). It's the canonical Databricks App pattern, works out of the box, no background tasks. +2. **Use `max_lifetime=2700`, not 3600.** The default creates a race condition where connections are handed out with nearly-expired tokens. +3. **Always `sslmode=require`** on every connection (it's auto-injected as `PGSSLMODE` in Databricks Apps). +4. **Never use `config.token` / `oauth_token().access_token` as the PG password** — that's a workspace-scoped token. Use `generate_database_credential()` to mint a Lakebase-scoped one. +5. **Handle DNS issues on macOS** using the `hostaddr` workaround if your dev machine can't resolve Lakebase hostnames. +6. **Use context managers** (`with pool.connection() as conn:`) so connections are always returned to the pool. +7. **Expect 2-5 second wake-up latency** on the first query after scale-to-zero — retry with backoff. +8. **Log credential refresh events** in `OAuthConnection.connect()` during early development — makes token-related failures easy to spot. From 6f319d6c1236b8e4d86177b0076d108e872a8b94 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Thu, 23 Apr 2026 19:20:16 +1000 Subject: [PATCH 2/6] docs(lakebase-autoscale): add "no separate Lakebase SDK" framing + cross-language table MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The existing overview jumped straight into features. Readers arriving from "how do I use Lakebase from Python?" needed two things made explicit: 1. There is no separate Lakebase SDK for Python. You use databricks-sdk only for minting OAuth credentials; a standard Postgres driver does the actual queries. (This was implicit in the connection patterns doc but not called out up-front.) 2. Node/TypeScript has a convenience wrapper: @databricks/lakebase (re-exported by @databricks/appkit). Autoscaling-only, not Provisioned. Worth mentioning so JS/TS readers know it exists. Also added a cross-language summary table and an explicit "What NOT to do" list — most importantly flagging that WorkspaceClient().config.token is workspace-scoped and will be rejected at Postgres login. This is a trap several of us have fallen into. Co-authored-by: Isaac --- .../databricks-lakebase-autoscale/SKILL.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/databricks-skills/databricks-lakebase-autoscale/SKILL.md b/databricks-skills/databricks-lakebase-autoscale/SKILL.md index f471765c..98a46181 100644 --- a/databricks-skills/databricks-lakebase-autoscale/SKILL.md +++ b/databricks-skills/databricks-lakebase-autoscale/SKILL.md @@ -20,6 +20,25 @@ Use this skill when: Lakebase Autoscaling is Databricks' next-generation managed PostgreSQL service for OLTP workloads. It provides autoscaling compute, Git-like branching, scale-to-zero, and instant point-in-time restore. +> **There is no separate "Lakebase SDK" for Python.** You use the Databricks SDK (`databricks-sdk`) **only** to mint short-lived OAuth credentials via `WorkspaceClient().postgres.generate_database_credential(...)`, then connect with a standard Postgres driver (`psycopg`, `SQLAlchemy`, JDBC, etc.). For Node/TypeScript, the convenience wrapper [`@databricks/lakebase`](https://github.com/databricks/appkit/blob/main/packages/lakebase/README.md) exists (Autoscaling only — not Provisioned). + +### Cross-language summary + +| Language | Credential SDK | DB Driver | +|----------|----------------|-----------| +| **Python** | `databricks-sdk` (`WorkspaceClient`) | `psycopg[binary,pool]` (canonical) or `SQLAlchemy` | +| **Node/TS** | `@databricks/lakebase` (handles both) | `@databricks/lakebase` wraps `pg` pool | +| **Java/Go** | Databricks SDK for Java/Go | Standard JDBC / `pgx` | + +### What NOT to do + +- ❌ Hardcode a static Postgres password +- ❌ Manually manage long-lived DB credentials +- ❌ Use `WorkspaceClient().config.token` as the Postgres password — that's a **workspace-scoped** token and will fail at Postgres login. You need the Lakebase-scoped token from `generate_database_credential()`. +- ❌ Treat Lakebase like a Databricks SQL warehouse connection (it's Postgres, not DBSQL) +- ❌ Bypass the app resource model when running inside a Databricks App + + | Feature | Description | |---------|-------------| | **Autoscaling Compute** | 0.5-112 CU with 2 GB RAM per CU; scales dynamically based on load | From 3795f6bd529653ac00eb7c018a5c47c1cdeebb26 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Mon, 27 Apr 2026 22:56:35 +1000 Subject: [PATCH 3/6] docs(lakebase-autoscale): address PR review feedback - Fix PGAPPNAME omission: 6 env vars auto-injected, not 5; note multi-resource caveat - Add psycopg3 pin comment explaining why psycopg2 won't work (no connection_class hook) - Strengthen open=False rationale: deprecated for AsyncConnectionPool, errors in psycopg 4.0 - Clarify @databricks/lakebase scope in cross-language table (Autoscaling only) Co-authored-by: Isaac --- databricks-skills/databricks-lakebase-autoscale/SKILL.md | 2 +- .../databricks-lakebase-autoscale/connection-patterns.md | 9 +++++---- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/databricks-skills/databricks-lakebase-autoscale/SKILL.md b/databricks-skills/databricks-lakebase-autoscale/SKILL.md index 98a46181..05d46d3b 100644 --- a/databricks-skills/databricks-lakebase-autoscale/SKILL.md +++ b/databricks-skills/databricks-lakebase-autoscale/SKILL.md @@ -27,7 +27,7 @@ Lakebase Autoscaling is Databricks' next-generation managed PostgreSQL service f | Language | Credential SDK | DB Driver | |----------|----------------|-----------| | **Python** | `databricks-sdk` (`WorkspaceClient`) | `psycopg[binary,pool]` (canonical) or `SQLAlchemy` | -| **Node/TS** | `@databricks/lakebase` (handles both) | `@databricks/lakebase` wraps `pg` pool | +| **Node/TS** | `@databricks/lakebase` (Autoscaling only) | `@databricks/lakebase` wraps `pg` pool | | **Java/Go** | Databricks SDK for Java/Go | Standard JDBC / `pgx` | ### What NOT to do diff --git a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md b/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md index bd13c2b3..7c54be8a 100644 --- a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md +++ b/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md @@ -55,8 +55,9 @@ This is the pattern from the official Databricks tutorial, external app guide, a ```yaml command: ['flask', '--app', 'app.py', 'run', '--host', '0.0.0.0', '--port', '8000'] env: - # These 5 are auto-injected when you add a Lakebase (postgres) resource in the UI: - # PGHOST, PGPORT, PGDATABASE, PGUSER, PGSSLMODE + # These 6 are auto-injected when you add a Lakebase (postgres) resource in the UI: + # PGAPPNAME, PGHOST, PGPORT, PGDATABASE, PGUSER, PGSSLMODE + # Only the *first* database resource gets auto-injected; extra resources need explicit valueFrom. # You MUST manually add ENDPOINT_NAME — it's needed by generate_database_credential(): - name: ENDPOINT_NAME value: 'projects//branches//endpoints/' @@ -66,7 +67,7 @@ env: ``` flask -psycopg[binary,pool]>=3.1.0 +psycopg[binary,pool]>=3.1.0 # psycopg3 required — psycopg2.pool has no connection_class hook for OAuthConnection databricks-sdk>=0.81.0 ``` @@ -142,7 +143,7 @@ if __name__ == '__main__': ### FastAPI variant -Identical pattern, but use `open=False` with an explicit lifespan so startup failures surface immediately: +Identical pattern, but use `open=False` with an explicit lifespan. Two reasons: (1) startup failures surface immediately via `pool.open(wait=True)`; (2) `open=True` is deprecated for `AsyncConnectionPool` and will raise an error in psycopg 4.0 — using `open=False` + lifespan is the forward-compatible pattern for any FastAPI app regardless of sync/async pool. ```python from contextlib import asynccontextmanager From 20df098ae82942fed119f40ec90ca73dcad44033 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sat, 9 May 2026 14:57:19 +1000 Subject: [PATCH 4/6] docs(lakebase-autoscale): soften 2700 framing + clarify Pattern 2 scope MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address Can Köklü's two soft suggestions from internal Slack feedback: - Soften "Always use 2700" to "Prefer 2700" — note that the official tutorial doesn't set max_lifetime and databricks-ai-bridge uses 2700, so 2700 is a defensive convention rather than a spec requirement. - Retitle Pattern 2 to "SQLAlchemy do_connect Event + Background Refresh Loop (Alternative)" so the demotion clearly targets the homegrown asyncio.Task refresh loop, not do_connect itself. do_connect is the official Databricks SQLAlchemy auth hook. - Add a callout in Pattern 2 distinguishing the official do_connect event from the community asyncio.Task variant, and a one-line alternative path for SQLAlchemy users who don't want a background loop. No structural changes — densification pass to follow. Co-authored-by: Isaac --- .../connection-patterns.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md b/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md index 7c54be8a..b789fd8e 100644 --- a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md +++ b/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md @@ -48,7 +48,7 @@ This is the pattern from the official Databricks tutorial, external app guide, a **Result:** Fully transparent token rotation with zero background tasks, zero timers, zero manual refresh logic. -> **Why not `max_lifetime=3600` (the default)?** You'd hand out connections with nearly-expired tokens. A connection established at minute 59 with a token that expires at minute 60 will fail a minute later. Always use 2700. +> **Why not `max_lifetime=3600` (the default)?** You'd hand out connections with nearly-expired tokens. A connection established at minute 59 with a token that expires at minute 60 will fail a minute later. Prefer 2700 — a 15-minute buffer before the 1-hour expiry. (The official tutorial leaves `max_lifetime` unset and relies on psycopg's defaults; `databricks-ai-bridge` uses 2700. 2700 isn't prescribed by any official spec — it's a defensive convention.) ### `app.yaml` @@ -121,8 +121,8 @@ pool = ConnectionPool( connection_class=OAuthConnection, min_size=1, max_size=10, - # CRITICAL: 2700 (45 min), not the 3600 default. - # Recycles connections 15 min before the 1-hour token expiry. + # 2700 (45 min) recycles connections 15 min before the 1-hour token expiry. + # The official tutorial doesn't set max_lifetime; databricks-ai-bridge uses 2700. max_lifetime=2700, open=True, ) @@ -176,9 +176,11 @@ def get_data(): # sync def — FastAPI runs in threadpool automatically return cur.fetchall() ``` -## 2. SQLAlchemy `do_connect` Event (Alternative) +## 2. SQLAlchemy `do_connect` Event + Background Refresh Loop (Alternative) -**Use only if your app is already SQLAlchemy-async.** Otherwise prefer pattern 1 — this adds a background refresh task you don't need. +**Use only if your app is already SQLAlchemy-async.** Otherwise prefer pattern 1 — the variant below adds a background refresh task you don't need. + +> **What's official vs. what's a community variant.** The `do_connect` event itself is the official Databricks-recommended way to inject credentials into a SQLAlchemy engine, and `databricks-ai-bridge.AsyncLakebaseSQLAlchemy` uses it. What's *not* in any official doc is layering a background `asyncio.Task` on top to pre-warm tokens. That's the part this section demotes. If you're already on SQLAlchemy and want to avoid a background loop, the simplest port is to call `engine.dispose()` on a schedule (or rely on `pool_recycle`) and let `do_connect` re-mint the credential on the next checkout — same idea as pattern 1, just routed through SQLAlchemy. ```python import asyncio @@ -406,7 +408,7 @@ conn = psycopg.connect(**conn_params) ## Best Practices 1. **Default to pattern 1** (`psycopg_pool.ConnectionPool` + `OAuthConnection`). It's the canonical Databricks App pattern, works out of the box, no background tasks. -2. **Use `max_lifetime=2700`, not 3600.** The default creates a race condition where connections are handed out with nearly-expired tokens. +2. **Prefer `max_lifetime=2700` over the 3600 default.** A 15-minute buffer before the 1-hour token expiry avoids handing out connections with nearly-expired tokens. Not a hard spec — the official tutorial doesn't set it; `databricks-ai-bridge` uses 2700. 3. **Always `sslmode=require`** on every connection (it's auto-injected as `PGSSLMODE` in Databricks Apps). 4. **Never use `config.token` / `oauth_token().access_token` as the PG password** — that's a workspace-scoped token. Use `generate_database_credential()` to mint a Lakebase-scoped one. 5. **Handle DNS issues on macOS** using the `hostaddr` workaround if your dev machine can't resolve Lakebase hostnames. From 94f9a7f153002563e4b83591798b12b352dcebb9 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sat, 9 May 2026 15:07:29 +1000 Subject: [PATCH 5/6] docs(lakebase-autoscale): densify per Quentin's review (gpt-5.5 in logfood) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ran the same densification prompt Quentin used on the mlflow skill against this skill via gpt-5.5 in logfood. Restructure: 6 files / 1,439 lines → 4 files / 769 lines (47% reduction). Structural changes: - SKILL.md trimmed to dense overview + cross-language framing + resource model + non-obvious facts to preserve. Trigger description retained verbatim (the "Use when..." phrasing required by the skill convention). - connection-patterns.md → connections.md. Drops the full Flask/FastAPI app implementations and the LakebaseAutoscaleConnectionManager class; keeps the canonical OAuthConnection skeleton, the do_connect hook, Databricks Apps env-var gotchas, DNS workaround, retry/timeout notes. - projects.md + branches.md + computes.md → operations.md. Drops generic SDK CRUD examples; keeps API names, FieldMask paths, TTL/protected/default/parent-child constraints, CU/RAM/connection-limit table, scale-to-zero defaults, project limits, MCP tool intent. - reverse-etl.md compressed; keeps namespace split (w.database, not w.postgres), CDF requirement, type mapping, limits, and the deletion sequence. Hard constraints preserved through the densification: - Canonical Pattern 1 (psycopg_pool + OAuthConnection + max_lifetime=2700). - The "config.token is workspace-scoped and FAILS at Postgres login — use generate_database_credential() instead" warning. - Cross-language Python/Node-TS/Java-Go table. - "There is no separate Lakebase SDK" framing. - "Prefer 2700" softening (no "Always use 2700") — defensive convention, not a spec requirement. - do_connect is the official Databricks SQLAlchemy auth hook (databricks-ai-bridge uses it); only the homegrown asyncio.Task refresh loop is demoted as a community variant. Co-authored-by: Isaac --- .../databricks-lakebase-autoscale/SKILL.md | 398 ++++------------- .../databricks-lakebase-autoscale/branches.md | 212 --------- .../databricks-lakebase-autoscale/computes.md | 208 --------- .../connection-patterns.md | 417 ------------------ .../connections.md | 212 +++++++++ .../operations.md | 297 +++++++++++++ .../databricks-lakebase-autoscale/projects.md | 204 --------- .../reverse-etl.md | 174 +++----- 8 files changed, 660 insertions(+), 1462 deletions(-) delete mode 100644 databricks-skills/databricks-lakebase-autoscale/branches.md delete mode 100644 databricks-skills/databricks-lakebase-autoscale/computes.md delete mode 100644 databricks-skills/databricks-lakebase-autoscale/connection-patterns.md create mode 100644 databricks-skills/databricks-lakebase-autoscale/connections.md create mode 100644 databricks-skills/databricks-lakebase-autoscale/operations.md delete mode 100644 databricks-skills/databricks-lakebase-autoscale/projects.md diff --git a/databricks-skills/databricks-lakebase-autoscale/SKILL.md b/databricks-skills/databricks-lakebase-autoscale/SKILL.md index 05d46d3b..e8a4d61a 100644 --- a/databricks-skills/databricks-lakebase-autoscale/SKILL.md +++ b/databricks-skills/databricks-lakebase-autoscale/SKILL.md @@ -5,349 +5,129 @@ description: "Patterns and best practices for Lakebase Autoscaling (next-gen man # Lakebase Autoscaling -Patterns and best practices for using Lakebase Autoscaling, the next-generation managed PostgreSQL on Databricks with autoscaling compute, branching, scale-to-zero, and instant restore. +Lakebase Autoscaling is Databricks' next-generation managed PostgreSQL service for OLTP workloads: autoscaling compute, database branching, scale-to-zero, instant restore, and Delta-to-Postgres synced tables. -## When to Use +Use this skill when creating/managing Lakebase Autoscaling projects, branches, endpoints/computes, credentials, reverse ETL synced tables, or app connections. -Use this skill when: -- Building applications that need a PostgreSQL database with autoscaling compute -- Working with database branching for dev/test/staging workflows -- Adding persistent state to applications with scale-to-zero cost savings -- Implementing reverse ETL from Delta Lake to an operational database via synced tables -- Managing Lakebase Autoscaling projects, branches, computes, or credentials +## Core framing -## Overview +> **There is no separate Python “Lakebase SDK.”** Use `databricks-sdk` for management and for minting short-lived database credentials with `WorkspaceClient().postgres.generate_database_credential(...)`; use standard Postgres drivers (`psycopg`, SQLAlchemy, JDBC, `pgx`, etc.) for SQL. -Lakebase Autoscaling is Databricks' next-generation managed PostgreSQL service for OLTP workloads. It provides autoscaling compute, Git-like branching, scale-to-zero, and instant point-in-time restore. - -> **There is no separate "Lakebase SDK" for Python.** You use the Databricks SDK (`databricks-sdk`) **only** to mint short-lived OAuth credentials via `WorkspaceClient().postgres.generate_database_credential(...)`, then connect with a standard Postgres driver (`psycopg`, `SQLAlchemy`, JDBC, etc.). For Node/TypeScript, the convenience wrapper [`@databricks/lakebase`](https://github.com/databricks/appkit/blob/main/packages/lakebase/README.md) exists (Autoscaling only — not Provisioned). - -### Cross-language summary - -| Language | Credential SDK | DB Driver | -|----------|----------------|-----------| -| **Python** | `databricks-sdk` (`WorkspaceClient`) | `psycopg[binary,pool]` (canonical) or `SQLAlchemy` | -| **Node/TS** | `@databricks/lakebase` (Autoscaling only) | `@databricks/lakebase` wraps `pg` pool | +| Language | Credential / management SDK | DB driver / wrapper | +|---|---|---| +| **Python** | `databricks-sdk` `WorkspaceClient().postgres` | `psycopg[binary,pool]` canonical; SQLAlchemy supported | +| **Node/TS** | `@databricks/lakebase` convenience wrapper, Autoscaling only | Wrapper manages `pg` pool | | **Java/Go** | Databricks SDK for Java/Go | Standard JDBC / `pgx` | -### What NOT to do - -- ❌ Hardcode a static Postgres password -- ❌ Manually manage long-lived DB credentials -- ❌ Use `WorkspaceClient().config.token` as the Postgres password — that's a **workspace-scoped** token and will fail at Postgres login. You need the Lakebase-scoped token from `generate_database_credential()`. -- ❌ Treat Lakebase like a Databricks SQL warehouse connection (it's Postgres, not DBSQL) -- ❌ Bypass the app resource model when running inside a Databricks App +## Lead connection pattern +For production Python apps, start with: -| Feature | Description | -|---------|-------------| -| **Autoscaling Compute** | 0.5-112 CU with 2 GB RAM per CU; scales dynamically based on load | -| **Scale-to-Zero** | Compute suspends after configurable inactivity timeout | -| **Branching** | Create isolated database environments (like Git branches) for dev/test | -| **Instant Restore** | Point-in-time restore from any moment within the configured window (up to 35 days) | -| **OAuth Authentication** | Token-based auth via Databricks SDK (1-hour expiry) | -| **Reverse ETL** | Sync data from Delta tables to PostgreSQL via synced tables | +1. `psycopg_pool.ConnectionPool` +2. `connection_class=OAuthConnection`, where `OAuthConnection(psycopg.Connection).connect()` calls `w.postgres.generate_database_credential(endpoint=...)` +3. `max_lifetime=2700` -**Available Regions (AWS):** us-east-1, us-east-2, eu-central-1, eu-west-1, eu-west-2, ap-south-1, ap-southeast-1, ap-southeast-2 +This is the canonical pattern from the official Databricks Apps + Lakebase Autoscaling tutorial lineage and `databricks-ai-bridge`: no background token thread; physical connections get fresh credentials when opened/recycled. -**Available Regions (Azure Beta):** eastus2, westeurope, westus +Prefer `max_lifetime=2700` as a defensive 45-minute recycle before 1-hour token expiry. The official tutorial does not set `max_lifetime`; `databricks-ai-bridge` uses `2700`. -## Project Hierarchy +See `connections.md`. -Understanding the hierarchy is essential for working with Lakebase Autoscaling: - -``` -Project (top-level container) - └── Branch(es) (isolated database environments) - ├── Compute (primary R/W endpoint) - ├── Read Replica(s) (optional, read-only) - ├── Role(s) (Postgres roles) - └── Database(s) (Postgres databases) - └── Schema(s) -``` +## Critical auth warning -| Object | Description | -|--------|-------------| -| **Project** | Top-level container. Created via `w.postgres.create_project()`. | -| **Branch** | Isolated database environment with copy-on-write storage. Default branch is `production`. | -| **Compute** | Postgres server powering a branch. Configurable CU sizing and autoscaling. | -| **Database** | Standard Postgres database within a branch. Default is `databricks_postgres`. | +Do **not** use `WorkspaceClient().config.token`, `w.config.oauth_token().access_token`, or any workspace-scoped OAuth token as the Postgres password. It will fail at Postgres login. -## Quick Start - -Create a project and connect: +Use: ```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.postgres import Project, ProjectSpec - -w = WorkspaceClient() - -# Create a project (long-running operation) -operation = w.postgres.create_project( - project=Project( - spec=ProjectSpec( - display_name="My Application", - pg_version="17" - ) - ), - project_id="my-app" -) -result = operation.wait() -print(f"Created project: {result.name}") +cred = WorkspaceClient().postgres.generate_database_credential(endpoint=endpoint_name) +password = cred.token ``` -## Common Patterns - -### Generate OAuth Token - -```python -from databricks.sdk import WorkspaceClient +That token is Lakebase-scoped and is used as the Postgres password with `sslmode=require`. -w = WorkspaceClient() +## Resource model -# Generate database credential for connecting (optionally scoped to an endpoint) -cred = w.postgres.generate_database_credential( - endpoint="projects/my-app/branches/production/endpoints/ep-primary" -) -token = cred.token # Use as password in connection string -# Token expires after 1 hour +```text +Project + └── Branches + ├── Endpoint/Compute: primary read-write endpoint + ├── Read replicas: optional read-only endpoints + ├── Roles + └── Databases + └── Schemas/Tables ``` -### Connect from Notebook +Canonical names: -```python -import psycopg -from databricks.sdk import WorkspaceClient - -w = WorkspaceClient() - -# Get endpoint details -endpoint = w.postgres.get_endpoint( - name="projects/my-app/branches/production/endpoints/ep-primary" -) -host = endpoint.status.hosts.host - -# Generate token (scoped to endpoint) -cred = w.postgres.generate_database_credential( - endpoint="projects/my-app/branches/production/endpoints/ep-primary" -) - -# Connect using psycopg3 -conn_string = ( - f"host={host} " - f"dbname=databricks_postgres " - f"user={w.current_user.me().user_name} " - f"password={cred.token} " - f"sslmode=require" -) -with psycopg.connect(conn_string) as conn: - with conn.cursor() as cur: - cur.execute("SELECT version()") - print(cur.fetchone()) +```text +projects/{project_id} +projects/{project_id}/branches/{branch_id} +projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id} ``` -### Create a Branch for Development +Defaults on project creation: +- default branch: `production` +- default database: `databricks_postgres` +- primary read-write endpoint/compute +- Postgres role for the creator’s Databricks identity -```python -from databricks.sdk.service.postgres import Branch, BranchSpec, Duration - -# Create a dev branch with 7-day expiration -branch = w.postgres.create_branch( - parent="projects/my-app", - branch=Branch( - spec=BranchSpec( - source_branch="projects/my-app/branches/production", - ttl=Duration(seconds=604800) # 7 days - ) - ), - branch_id="development" -).wait() -print(f"Branch created: {branch.name}") -``` - -### Resize Compute (Autoscaling) - -```python -from databricks.sdk.service.postgres import Endpoint, EndpointSpec, FieldMask - -# Update compute to autoscale between 2-8 CU -w.postgres.update_endpoint( - name="projects/my-app/branches/production/endpoints/ep-primary", - endpoint=Endpoint( - name="projects/my-app/branches/production/endpoints/ep-primary", - spec=EndpointSpec( - autoscaling_limit_min_cu=2.0, - autoscaling_limit_max_cu=8.0 - ) - ), - update_mask=FieldMask(field_mask=[ - "spec.autoscaling_limit_min_cu", - "spec.autoscaling_limit_max_cu" - ]) -).wait() -``` - -## MCP Tools - -The following MCP tools are available for managing Lakebase infrastructure. Use `type="autoscale"` for Lakebase Autoscaling. - -### manage_lakebase_database - Project Management - -| Action | Description | Required Params | -|--------|-------------|-----------------| -| `create_or_update` | Create or update a project | name | -| `get` | Get project details (includes branches/endpoints) | name | -| `list` | List all projects | (none, optional type filter) | -| `delete` | Delete project and all branches/computes/data | name | - -**Example usage:** -```python -# Create an autoscale project -manage_lakebase_database( - action="create_or_update", - name="my-app", - type="autoscale", - display_name="My Application", - pg_version="17" -) - -# Get project with branches -manage_lakebase_database(action="get", name="my-app", type="autoscale") - -# Delete project -manage_lakebase_database(action="delete", name="my-app", type="autoscale") -``` +Key SDK namespace: `WorkspaceClient().postgres`. -### manage_lakebase_branch - Branch Management +Most create/update/delete calls return long-running operations; call `.wait()`. -| Action | Description | Required Params | -|--------|-------------|-----------------| -| `create_or_update` | Create/update branch with compute endpoint | project_name, branch_id | -| `delete` | Delete branch and endpoints | name (full branch name) | - -**Example usage:** -```python -# Create a dev branch with 7-day TTL -manage_lakebase_branch( - action="create_or_update", - project_name="my-app", - branch_id="development", - source_branch="production", - ttl_seconds=604800, # 7 days - autoscaling_limit_min_cu=0.5, - autoscaling_limit_max_cu=4.0, - scale_to_zero_seconds=300 -) - -# Delete branch -manage_lakebase_branch(action="delete", name="projects/my-app/branches/development") -``` - -### generate_lakebase_credential - OAuth Tokens - -Generate OAuth token (~1hr) for PostgreSQL connections. Use as password with `sslmode=require`. - -```python -# For autoscale endpoints -generate_lakebase_credential(endpoint="projects/my-app/branches/production/endpoints/ep-primary") -``` - -## Reference Files - -- [projects.md](projects.md) - Project management patterns and settings -- [branches.md](branches.md) - Branching workflows, protection, and expiration -- [computes.md](computes.md) - Compute sizing, autoscaling, and scale-to-zero -- [connection-patterns.md](connection-patterns.md) - Connection patterns for different use cases -- [reverse-etl.md](reverse-etl.md) - Synced tables from Delta Lake to Lakebase - -## CLI Quick Reference - -```bash -# Create a project -databricks postgres create-project \ - --project-id my-app \ - --json '{"spec": {"display_name": "My App", "pg_version": "17"}}' - -# List projects -databricks postgres list-projects - -# Get project details -databricks postgres get-project projects/my-app - -# Create a branch -databricks postgres create-branch projects/my-app development \ - --json '{"spec": {"source_branch": "projects/my-app/branches/production", "no_expiry": true}}' - -# List branches -databricks postgres list-branches projects/my-app - -# Get endpoint details -databricks postgres get-endpoint projects/my-app/branches/production/endpoints/ep-primary - -# Delete a project -databricks postgres delete-project projects/my-app -``` - -## Key Differences from Lakebase Provisioned +## Lakebase Autoscaling vs Provisioned | Aspect | Provisioned | Autoscaling | -|--------|-------------|-------------| +|---|---|---| | SDK module | `w.database` | `w.postgres` | | Top-level resource | Instance | Project | -| Capacity | CU_1, CU_2, CU_4, CU_8 (16 GB/CU) | 0.5-112 CU (2 GB/CU) | -| Branching | Not supported | Full branching support | -| Scale-to-zero | Not supported | Configurable timeout | -| Operations | Synchronous | Long-running operations (LRO) | -| Read replicas | Readable secondaries | Dedicated read-only endpoints | - -## Common Issues - -| Issue | Solution | -|-------|----------| -| **Token expired during long query** | Implement token refresh loop; tokens expire after 1 hour | -| **Connection refused after scale-to-zero** | Compute wakes automatically on connection; reactivation takes a few hundred ms; implement retry logic | -| **DNS resolution fails on macOS** | Use `dig` command to resolve hostname, pass `hostaddr` to psycopg | -| **Branch deletion blocked** | Delete child branches first; cannot delete branches with children | -| **Autoscaling range too wide** | Max - min cannot exceed 8 CU (e.g., 8-16 CU is valid, 0.5-32 CU is not) | -| **SSL required error** | Always use `sslmode=require` in connection string | -| **Update mask required** | All update operations require an `update_mask` specifying fields to modify | -| **Connection closed after 24h idle** | All connections have a 24-hour idle timeout and 3-day max lifetime; implement retry logic | - -## Current Limitations - -These features are NOT yet supported in Lakebase Autoscaling: -- High availability with readable secondaries (use read replicas instead) -- Databricks Apps UI integration (Apps can connect manually via credentials) -- Feature Store integration -- Stateful AI agents (LangChain memory) -- Postgres-to-Delta sync (only Delta-to-Postgres reverse ETL) -- Custom billing tags and serverless budget policies -- Direct migration from Lakebase Provisioned (use pg_dump/pg_restore or reverse ETL) - -## SDK Version Requirements - -- **Databricks SDK for Python**: >= 0.81.0 (for `w.postgres` module) -- **psycopg**: 3.x (supports `hostaddr` parameter for DNS workaround) -- **SQLAlchemy**: 2.x with `postgresql+psycopg` driver +| Capacity | fixed CU tiers, ~16 GB/CU | 0.5–112 CU, ~2 GB/CU | +| Branching | no | yes | +| Scale-to-zero | no | yes | +| Operations | mostly synchronous | LROs; use `.wait()` | +| Reverse ETL | synced tables | synced tables | +| Read replicas | readable secondaries | dedicated read-only endpoints | + +## Non-obvious facts to preserve + +- Postgres versions: **16 and 17**. +- AWS regions: `us-east-1`, `us-east-2`, `eu-central-1`, `eu-west-1`, `eu-west-2`, `ap-south-1`, `ap-southeast-1`, `ap-southeast-2`. +- Azure beta regions: `eastus2`, `westeurope`, `westus`. +- Autoscaling computes: 0.5–32 CU with `max - min <= 8`. +- Large fixed computes: 36–112 CU. +- Autoscaling CU ≈ 2 GB RAM. +- `sslmode=require` on all driver connections. +- Endpoint host comes from `w.postgres.get_endpoint(...).status.hosts.host`. +- GET responses often return effective properties under `status`; create/update payloads use `spec`. +- All update calls need a `FieldMask`. +- Scale-to-zero wake-up is automatic but apps should retry. +- Connections can be closed by platform timeouts: 24-hour idle timeout and 3-day max connection lifetime. +- macOS DNS can fail on long Lakebase hostnames; if so, resolve to IP and pass both `host` and `hostaddr` to psycopg. +- Triggered/Continuous synced tables require Delta Change Data Feed. +- Reverse ETL is Delta-to-Postgres only; not Postgres-to-Delta. + +## Task files + +- `connections.md` — app/notebook connection patterns and credential rotation. +- `operations.md` — project, branch, endpoint/compute, scale-to-zero, limits, MCP mapping. +- `reverse-etl.md` — synced tables from Delta Lake to Lakebase. + +## SDK / package versions -```python -%pip install -U "databricks-sdk>=0.81.0" "psycopg[binary]>=3.0" sqlalchemy +```bash +pip install -U "databricks-sdk>=0.81.0" "psycopg[binary,pool]>=3.1" "sqlalchemy>=2" ``` -## Notes - -- **Compute Units** in Autoscaling provide ~2 GB RAM each (vs 16 GB in Provisioned). -- **Resource naming** follows hierarchical paths: `projects/{id}/branches/{id}/endpoints/{id}`. -- All create/update/delete operations are **long-running** -- use `.wait()` in the SDK. -- Tokens are short-lived (1 hour) -- production apps MUST implement token refresh. -- **Postgres versions** 16 and 17 are supported. +Use SQLAlchemy URL prefix `postgresql+psycopg://...` for psycopg3. -## Related Skills +## Current limitations -- **[databricks-lakebase-provisioned](../databricks-lakebase-provisioned/SKILL.md)** - fixed-capacity managed PostgreSQL (predecessor) -- **[databricks-app-apx](../databricks-app-apx/SKILL.md)** - full-stack apps that can use Lakebase for persistence -- **[databricks-app-python](../databricks-app-python/SKILL.md)** - Python apps with Lakebase backend -- **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** - SDK used for project management and token generation -- **[databricks-bundles](../databricks-bundles/SKILL.md)** - deploying apps with Lakebase resources -- **[databricks-jobs](../databricks-jobs/SKILL.md)** - scheduling reverse ETL sync jobs +Not yet supported or not equivalent to Provisioned: +- High availability with readable secondaries; use read replicas instead. +- Databricks Apps UI integration may lag; Apps can connect manually via credentials/resource env vars. +- Feature Store integration. +- Stateful AI-agent memory integrations. +- Postgres-to-Delta sync. +- Custom billing tags / serverless budget policies. +- Direct migration from Lakebase Provisioned; use `pg_dump`/`pg_restore` or reverse ETL patterns where appropriate. diff --git a/databricks-skills/databricks-lakebase-autoscale/branches.md b/databricks-skills/databricks-lakebase-autoscale/branches.md deleted file mode 100644 index f44f7234..00000000 --- a/databricks-skills/databricks-lakebase-autoscale/branches.md +++ /dev/null @@ -1,212 +0,0 @@ -# Lakebase Autoscaling Branches - -## Overview - -Branches in Lakebase Autoscaling are isolated database environments that share storage with their parent through copy-on-write. They enable Git-like workflows for databases: create isolated dev/test environments, test schema changes safely, and recover from mistakes. - -## Branch Types - -| Option | Description | Use Case | -|--------|-------------|----------| -| **Current data** | Branch from latest state of parent | Development, testing with current data | -| **Past data** | Branch from a specific point in time | Point-in-time recovery, historical analysis | - -## Creating a Branch - -### With Expiration (TTL) - -```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.postgres import Branch, BranchSpec, Duration - -w = WorkspaceClient() - -# Create branch with 7-day expiration -result = w.postgres.create_branch( - parent="projects/my-app", - branch=Branch( - spec=BranchSpec( - source_branch="projects/my-app/branches/production", - ttl=Duration(seconds=604800) # 7 days - ) - ), - branch_id="development" -).wait() - -print(f"Branch created: {result.name}") -print(f"Expires: {result.status.expire_time}") -``` - -### Permanent Branch (No Expiration) - -```python -result = w.postgres.create_branch( - parent="projects/my-app", - branch=Branch( - spec=BranchSpec( - source_branch="projects/my-app/branches/production", - no_expiry=True - ) - ), - branch_id="staging" -).wait() -``` - -### CLI - -```bash -# With TTL -databricks postgres create-branch projects/my-app development \ - --json '{ - "spec": { - "source_branch": "projects/my-app/branches/production", - "ttl": "604800s" - } - }' - -# Permanent -databricks postgres create-branch projects/my-app staging \ - --json '{ - "spec": { - "source_branch": "projects/my-app/branches/production", - "no_expiry": true - } - }' -``` - -## Getting Branch Details - -```python -branch = w.postgres.get_branch( - name="projects/my-app/branches/development" -) - -print(f"Branch: {branch.name}") -print(f"Protected: {branch.status.is_protected}") -print(f"Default: {branch.status.default}") -print(f"State: {branch.status.current_state}") -print(f"Size: {branch.status.logical_size_bytes} bytes") -``` - -## Listing Branches - -```python -branches = list(w.postgres.list_branches( - parent="projects/my-app" -)) - -for branch in branches: - print(f"Branch: {branch.name}") - print(f" Default: {branch.status.default}") - print(f" Protected: {branch.status.is_protected}") -``` - -## Protecting a Branch - -Protected branches cannot be deleted, reset, or archived. - -```python -from databricks.sdk.service.postgres import Branch, BranchSpec, FieldMask - -w.postgres.update_branch( - name="projects/my-app/branches/production", - branch=Branch( - name="projects/my-app/branches/production", - spec=BranchSpec(is_protected=True) - ), - update_mask=FieldMask(field_mask=["spec.is_protected"]) -).wait() -``` - -To remove protection: - -```python -w.postgres.update_branch( - name="projects/my-app/branches/production", - branch=Branch( - name="projects/my-app/branches/production", - spec=BranchSpec(is_protected=False) - ), - update_mask=FieldMask(field_mask=["spec.is_protected"]) -).wait() -``` - -## Updating Branch Expiration - -```python -# Extend to 14 days -w.postgres.update_branch( - name="projects/my-app/branches/development", - branch=Branch( - name="projects/my-app/branches/development", - spec=BranchSpec( - is_protected=False, - ttl=Duration(seconds=1209600) # 14 days - ) - ), - update_mask=FieldMask(field_mask=["spec.is_protected", "spec.expiration"]) -).wait() - -# Remove expiration -w.postgres.update_branch( - name="projects/my-app/branches/development", - branch=Branch( - name="projects/my-app/branches/development", - spec=BranchSpec(no_expiry=True) - ), - update_mask=FieldMask(field_mask=["spec.expiration"]) -).wait() -``` - -## Resetting a Branch from Parent - -Reset completely replaces a branch's data and schema with the latest from its parent. Local changes are lost. - -```python -w.postgres.reset_branch( - name="projects/my-app/branches/development" -).wait() -``` - -**Constraints:** -- Root branches (like `production`) cannot be reset (no parent) -- Branches with children cannot be reset (delete children first) -- Connections are temporarily interrupted during reset - -## Deleting a Branch - -```python -w.postgres.delete_branch( - name="projects/my-app/branches/development" -).wait() -``` - -**Constraints:** -- Cannot delete branches with child branches (delete children first) -- Cannot delete protected branches (remove protection first) -- Cannot delete the default branch - -## Branch Expiration - -Branch expiration sets an automatic deletion timestamp. Useful for: -- **CI/CD environments**: 2-4 hours -- **Demos**: 24-48 hours -- **Feature development**: 1-7 days -- **Long-term testing**: up to 30 days - -**Maximum expiration period:** 30 days from current time. - -### Expiration Restrictions - -- Cannot expire protected branches -- Cannot expire default branches -- Cannot expire branches that have children -- When a branch expires, all compute resources are also deleted - -## Best Practices - -1. **Use TTL for ephemeral branches**: Set expiration for dev/test branches to avoid accumulation -2. **Protect production branches**: Prevent accidental deletion or reset -3. **Reset instead of recreate**: Use reset from parent when you need fresh data without new branch overhead -4. **Schema diff before merge**: Compare schemas between branches before applying changes to production -5. **Monitor unarchived limit**: Only 10 unarchived branches are allowed per project diff --git a/databricks-skills/databricks-lakebase-autoscale/computes.md b/databricks-skills/databricks-lakebase-autoscale/computes.md deleted file mode 100644 index 0f53d50c..00000000 --- a/databricks-skills/databricks-lakebase-autoscale/computes.md +++ /dev/null @@ -1,208 +0,0 @@ -# Lakebase Autoscaling Computes - -## Overview - -A compute is a virtualized service that runs Postgres for a branch. Each branch has one primary read-write compute and can have optional read replicas. Computes support autoscaling, scale-to-zero, and granular sizing from 0.5 to 112 CU. - -## Compute Sizing - -Each Compute Unit (CU) allocates approximately 2 GB of RAM. - -### Available Sizes - -| Category | Range | Notes | -|----------|-------|-------| -| **Autoscale computes** | 0.5-32 CU | Dynamic scaling within range (max-min <= 8 CU) | -| **Large fixed-size** | 36-112 CU | Fixed size, no autoscaling | - -### Representative Sizes - -| Compute Units | RAM | Max Connections | -|--------------|-----|-----------------| -| 0.5 CU | ~1 GB | 104 | -| 1 CU | ~2 GB | 209 | -| 4 CU | ~8 GB | 839 | -| 8 CU | ~16 GB | 1,678 | -| 16 CU | ~32 GB | 3,357 | -| 32 CU | ~64 GB | 4,000 | -| 64 CU | ~128 GB | 4,000 | -| 112 CU | ~224 GB | 4,000 | - -**Note:** Lakebase Provisioned used ~16 GB per CU. Autoscaling uses ~2 GB per CU for more granular scaling. - -## Creating a Compute - -```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.postgres import Endpoint, EndpointSpec, EndpointType - -w = WorkspaceClient() - -# Create a read-write compute endpoint -result = w.postgres.create_endpoint( - parent="projects/my-app/branches/production", - endpoint=Endpoint( - spec=EndpointSpec( - endpoint_type=EndpointType.ENDPOINT_TYPE_READ_WRITE, - autoscaling_limit_min_cu=0.5, - autoscaling_limit_max_cu=4.0 - ) - ), - endpoint_id="my-compute" -).wait() - -print(f"Endpoint created: {result.name}") -print(f"Host: {result.status.hosts.host}") -``` - -### CLI - -```bash -databricks postgres create-endpoint \ - projects/my-app/branches/production my-compute \ - --json '{ - "spec": { - "endpoint_type": "ENDPOINT_TYPE_READ_WRITE", - "autoscaling_limit_min_cu": 0.5, - "autoscaling_limit_max_cu": 4.0 - } - }' -``` - -**Important:** Each branch can have only one read-write compute. - -## Getting Compute Details - -```python -endpoint = w.postgres.get_endpoint( - name="projects/my-app/branches/production/endpoints/my-compute" -) - -print(f"Endpoint: {endpoint.name}") -print(f"Type: {endpoint.status.endpoint_type}") -print(f"State: {endpoint.status.current_state}") -print(f"Host: {endpoint.status.hosts.host}") -print(f"Min CU: {endpoint.status.autoscaling_limit_min_cu}") -print(f"Max CU: {endpoint.status.autoscaling_limit_max_cu}") -``` - -## Listing Computes - -```python -endpoints = list(w.postgres.list_endpoints( - parent="projects/my-app/branches/production" -)) - -for ep in endpoints: - print(f"Endpoint: {ep.name}") - print(f" Type: {ep.status.endpoint_type}") - print(f" CU Range: {ep.status.autoscaling_limit_min_cu}-{ep.status.autoscaling_limit_max_cu}") -``` - -## Resizing a Compute - -Use `update_mask` to specify which fields to update: - -```python -from databricks.sdk.service.postgres import Endpoint, EndpointSpec, FieldMask - -# Update min and max CU -w.postgres.update_endpoint( - name="projects/my-app/branches/production/endpoints/my-compute", - endpoint=Endpoint( - name="projects/my-app/branches/production/endpoints/my-compute", - spec=EndpointSpec( - autoscaling_limit_min_cu=2.0, - autoscaling_limit_max_cu=8.0 - ) - ), - update_mask=FieldMask(field_mask=[ - "spec.autoscaling_limit_min_cu", - "spec.autoscaling_limit_max_cu" - ]) -).wait() -``` - -### CLI - -```bash -# Update single field -databricks postgres update-endpoint \ - projects/my-app/branches/production/endpoints/my-compute \ - spec.autoscaling_limit_max_cu \ - --json '{"spec": {"autoscaling_limit_max_cu": 8.0}}' - -# Update multiple fields -databricks postgres update-endpoint \ - projects/my-app/branches/production/endpoints/my-compute \ - "spec.autoscaling_limit_min_cu,spec.autoscaling_limit_max_cu" \ - --json '{"spec": {"autoscaling_limit_min_cu": 2.0, "autoscaling_limit_max_cu": 8.0}}' -``` - -## Deleting a Compute - -```python -w.postgres.delete_endpoint( - name="projects/my-app/branches/production/endpoints/my-compute" -).wait() -``` - -## Autoscaling - -Autoscaling dynamically adjusts compute resources based on workload demand. - -### Configuration - -- **Range:** 0.5-32 CU -- **Constraint:** Max - Min cannot exceed 8 CU -- **Valid examples:** 4-8 CU, 8-16 CU, 16-24 CU -- **Invalid example:** 0.5-32 CU (range of 31.5 CU) - -### Best Practices - -- Set minimum CU large enough to cache your working set in memory -- Performance may be degraded until compute scales up and caches data -- Connection limits are based on the maximum CU in the range - -## Scale-to-Zero - -Automatically suspends compute after a period of inactivity. - -| Setting | Description | -|---------|-------------| -| **Enabled** | Compute suspends after inactivity timeout (saves cost) | -| **Disabled** | Always-active compute (eliminates wake-up latency) | - -**Default behavior:** -- `production` branch: Scale-to-zero **disabled** (always active) -- Other branches: Scale-to-zero can be configured - -**Default inactivity timeout:** 5 minutes -**Minimum inactivity timeout:** 60 seconds - -### Wake-up Behavior - -When a connection arrives on a suspended compute: -1. Compute starts automatically (reactivation takes a few hundred milliseconds) -2. The connection request is handled transparently once active -3. Compute restarts at minimum autoscaling size (if autoscaling enabled) -4. Applications should implement connection retry logic for the brief reactivation period - -### Session Context After Reactivation - -When a compute suspends and reactivates, session context is **reset**: -- In-memory statistics and cache contents are cleared -- Temporary tables and prepared statements are lost -- Session-specific configuration settings reset -- Connection pools and active transactions are terminated - -If your application requires persistent session data, consider disabling scale-to-zero. - -## Sizing Guidance - -| Factor | Recommendation | -|--------|---------------| -| Query complexity | Complex analytical queries benefit from larger computes | -| Concurrent connections | More connections need more CPU and memory | -| Data volume | Larger datasets may need more memory for performance | -| Response time | Critical apps may require larger computes | diff --git a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md b/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md deleted file mode 100644 index b789fd8e..00000000 --- a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md +++ /dev/null @@ -1,417 +0,0 @@ -# Lakebase Autoscaling Connection Patterns - -## Overview - -This document covers the canonical connection patterns for Lakebase Autoscaling, ordered by recommendation: - -1. **`psycopg_pool.ConnectionPool` + `OAuthConnection`** — canonical for production Databricks Apps. Used by the [official tutorial](https://docs.databricks.com/aws/en/oltp/projects/tutorial-databricks-apps-autoscaling), the [external app SDK guide](https://docs.databricks.com/aws/en/oltp/projects/external-apps-connect), and [`databricks-ai-bridge`](https://github.com/databricks/databricks-ai-bridge/blob/main/src/databricks_ai_bridge/lakebase.py). Zero background threads — rotation is handled by pool recycling. -2. **SQLAlchemy `do_connect` event + background refresh** — alternative for apps already using SQLAlchemy async. Works but adds a background `asyncio.Task` you don't need. -3. **Direct `psycopg.connect`** — only for one-off scripts / notebooks where the session lives < 1 hour. -4. **Static URL** — local development only. - -## Authentication - -Lakebase Autoscaling supports two authentication methods: - -| Method | Token Lifetime | Best For | -|--------|---------------|----------| -| **OAuth tokens** (`generate_database_credential`) | 1 hour, enforced at login only | Apps — rotate via pool recycling | -| **Native Postgres passwords** | No expiry | Long-running processes, tools without token rotation | - -**Critical distinction:** The workspace OAuth token (`w.config.oauth_token().access_token`) is *workspace-scoped* — it will **fail at PG login**. You must call `w.postgres.generate_database_credential(endpoint=...)` to mint a separate *Lakebase-scoped* JWT: - -```python -# ✅ CORRECT — Lakebase-scoped database credential -cred = w.postgres.generate_database_credential(endpoint=endpoint_name) -password = cred.token - -# ❌ WRONG — workspace-scoped token -password = w.config.oauth_token().access_token -``` - -**Connection timeouts (both methods):** -- **24-hour idle timeout**: Connections with no activity for 24 hours are automatically closed -- **3-day maximum connection life**: Connections alive for more than 3 days may be closed - -Design your applications to handle connection timeouts with retry logic. - -## 1. `psycopg_pool.ConnectionPool` + `OAuthConnection` (CANONICAL) - -This is the pattern from the official Databricks tutorial, external app guide, and `databricks-ai-bridge`. **Use this for any production Databricks App.** - -### How it works - -1. `OAuthConnection.connect()` mints a fresh Lakebase credential every time the pool opens a new physical connection. -2. Lakebase tokens expire at 1 hour, but expiration is enforced **only at login** — already-open connections stay valid. -3. `max_lifetime=2700` (45 min) tells the pool to recycle connections before tokens expire. When the pool reopens, `OAuthConnection.connect()` fires and gets a fresh token. -4. The 15-minute buffer (60 min token − 45 min recycle) means you never race against expiry. - -**Result:** Fully transparent token rotation with zero background tasks, zero timers, zero manual refresh logic. - -> **Why not `max_lifetime=3600` (the default)?** You'd hand out connections with nearly-expired tokens. A connection established at minute 59 with a token that expires at minute 60 will fail a minute later. Prefer 2700 — a 15-minute buffer before the 1-hour expiry. (The official tutorial leaves `max_lifetime` unset and relies on psycopg's defaults; `databricks-ai-bridge` uses 2700. 2700 isn't prescribed by any official spec — it's a defensive convention.) - -### `app.yaml` - -```yaml -command: ['flask', '--app', 'app.py', 'run', '--host', '0.0.0.0', '--port', '8000'] -env: - # These 6 are auto-injected when you add a Lakebase (postgres) resource in the UI: - # PGAPPNAME, PGHOST, PGPORT, PGDATABASE, PGUSER, PGSSLMODE - # Only the *first* database resource gets auto-injected; extra resources need explicit valueFrom. - # You MUST manually add ENDPOINT_NAME — it's needed by generate_database_credential(): - - name: ENDPOINT_NAME - value: 'projects//branches//endpoints/' -``` - -### `requirements.txt` - -``` -flask -psycopg[binary,pool]>=3.1.0 # psycopg3 required — psycopg2.pool has no connection_class hook for OAuthConnection -databricks-sdk>=0.81.0 -``` - -### `app.py` (Flask) - -```python -import os -from databricks.sdk import WorkspaceClient -import psycopg -from psycopg_pool import ConnectionPool -from flask import Flask - -app = Flask(__name__) - -# Inside Databricks Apps, WorkspaceClient() auto-authenticates via SP credentials. -w = WorkspaceClient() - - -class OAuthConnection(psycopg.Connection): - """Inject a fresh Lakebase OAuth token on every pool-opened connection. - - The pool calls OAuthConnection.connect() when: - - Filling min_size on startup - - Recycling a connection (max_lifetime exceeded) - - Creating a new connection under load - - Replacing a connection that failed health-check - - No background refresh thread is needed: tokens are always fresh at login - time, and login is where Lakebase enforces expiration. - """ - - @classmethod - def connect(cls, conninfo='', **kwargs): - endpoint_name = os.environ["ENDPOINT_NAME"] - cred = w.postgres.generate_database_credential(endpoint=endpoint_name) - kwargs['password'] = cred.token - return super().connect(conninfo, **kwargs) - - -username = os.environ["PGUSER"] # SP client ID — auto-injected -host = os.environ["PGHOST"] # e.g. ep-restless-pond-e4wvk0yn... — auto-injected -port = os.environ.get("PGPORT", "5432") -database = os.environ["PGDATABASE"] # typically "databricks_postgres" — auto-injected -sslmode = os.environ.get("PGSSLMODE", "require") - -pool = ConnectionPool( - conninfo=( - f"dbname={database} user={username} " - f"host={host} port={port} sslmode={sslmode}" - ), - connection_class=OAuthConnection, - min_size=1, - max_size=10, - # 2700 (45 min) recycles connections 15 min before the 1-hour token expiry. - # The official tutorial doesn't set max_lifetime; databricks-ai-bridge uses 2700. - max_lifetime=2700, - open=True, -) - - -@app.route('/') -def index(): - with pool.connection() as conn: - with conn.cursor() as cur: - cur.execute("SELECT current_user, current_database()") - row = cur.fetchone() - return f"Connected as {row[0]} to {row[1]}" - - -if __name__ == '__main__': - app.run(host="0.0.0.0", port=8000) -``` - -### FastAPI variant - -Identical pattern, but use `open=False` with an explicit lifespan. Two reasons: (1) startup failures surface immediately via `pool.open(wait=True)`; (2) `open=True` is deprecated for `AsyncConnectionPool` and will raise an error in psycopg 4.0 — using `open=False` + lifespan is the forward-compatible pattern for any FastAPI app regardless of sync/async pool. - -```python -from contextlib import asynccontextmanager -from fastapi import FastAPI - -pool = ConnectionPool( - conninfo=..., - connection_class=OAuthConnection, - min_size=1, max_size=10, - max_lifetime=2700, - open=False, # Opened explicitly in lifespan -) - - -@asynccontextmanager -async def lifespan(app: FastAPI): - pool.open(wait=True, timeout=30.0) # Fail fast if DB unreachable - yield - pool.close() - - -app = FastAPI(lifespan=lifespan) - - -@app.get("/api/data") -def get_data(): # sync def — FastAPI runs in threadpool automatically - with pool.connection() as conn: - with conn.cursor() as cur: - cur.execute("SELECT ...") - return cur.fetchall() -``` - -## 2. SQLAlchemy `do_connect` Event + Background Refresh Loop (Alternative) - -**Use only if your app is already SQLAlchemy-async.** Otherwise prefer pattern 1 — the variant below adds a background refresh task you don't need. - -> **What's official vs. what's a community variant.** The `do_connect` event itself is the official Databricks-recommended way to inject credentials into a SQLAlchemy engine, and `databricks-ai-bridge.AsyncLakebaseSQLAlchemy` uses it. What's *not* in any official doc is layering a background `asyncio.Task` on top to pre-warm tokens. That's the part this section demotes. If you're already on SQLAlchemy and want to avoid a background loop, the simplest port is to call `engine.dispose()` on a schedule (or rely on `pool_recycle`) and let `do_connect` re-mint the credential on the next checkout — same idea as pattern 1, just routed through SQLAlchemy. - -```python -import asyncio -from typing import AsyncGenerator, Optional -from contextlib import asynccontextmanager - -from sqlalchemy import event -from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker -from databricks.sdk import WorkspaceClient - - -class LakebaseAutoscaleConnectionManager: - """Manages Lakebase Autoscaling connections with background token refresh. - - This pattern works but adds operational complexity (a background asyncio.Task) - that isn't necessary. Prefer psycopg_pool + OAuthConnection (pattern 1). - """ - - def __init__( - self, - project_id: str, - branch_id: str = "production", - database_name: str = "databricks_postgres", - pool_size: int = 5, - max_overflow: int = 10, - token_refresh_seconds: int = 3000, # 50 minutes - ): - self.project_id = project_id - self.branch_id = branch_id - self.database_name = database_name - self.pool_size = pool_size - self.max_overflow = max_overflow - self.token_refresh_seconds = token_refresh_seconds - - self._current_token: Optional[str] = None - self._refresh_task: Optional[asyncio.Task] = None - self._engine = None - self._session_maker = None - - def _endpoint_name(self) -> str: - w = WorkspaceClient() - endpoints = list(w.postgres.list_endpoints( - parent=f"projects/{self.project_id}/branches/{self.branch_id}" - )) - if not endpoints: - raise RuntimeError( - f"No endpoints for projects/{self.project_id}/branches/{self.branch_id}" - ) - return endpoints[0].name - - def _generate_token(self) -> str: - w = WorkspaceClient() - cred = w.postgres.generate_database_credential(endpoint=self._endpoint_name()) - return cred.token - - def _get_host(self) -> str: - w = WorkspaceClient() - ep = w.postgres.get_endpoint(name=self._endpoint_name()) - return ep.status.hosts.host - - async def _refresh_loop(self): - while True: - await asyncio.sleep(self.token_refresh_seconds) - try: - self._current_token = await asyncio.to_thread(self._generate_token) - except Exception as e: - print(f"Token refresh failed: {e}") - - def initialize(self): - w = WorkspaceClient() - host = self._get_host() - username = w.current_user.me().user_name - - self._current_token = self._generate_token() - - url = f"postgresql+psycopg://{username}@{host}:5432/{self.database_name}" - self._engine = create_async_engine( - url, - pool_size=self.pool_size, - max_overflow=self.max_overflow, - pool_recycle=3600, - connect_args={"sslmode": "require"}, - ) - - @event.listens_for(self._engine.sync_engine, "do_connect") - def inject_token(dialect, conn_rec, cargs, cparams): - cparams["password"] = self._current_token - - self._session_maker = async_sessionmaker( - self._engine, class_=AsyncSession, expire_on_commit=False - ) - - def start_refresh(self): - if not self._refresh_task: - self._refresh_task = asyncio.create_task(self._refresh_loop()) - - async def stop_refresh(self): - if self._refresh_task: - self._refresh_task.cancel() - try: - await self._refresh_task - except asyncio.CancelledError: - pass - self._refresh_task = None - - @asynccontextmanager - async def session(self) -> AsyncGenerator[AsyncSession, None]: - async with self._session_maker() as session: - yield session - - async def close(self): - await self.stop_refresh() - if self._engine: - await self._engine.dispose() -``` - -## 3. Direct `psycopg.connect` (Scripts / Notebooks Only) - -For one-off scripts or notebooks where the process lives well under an hour: - -```python -import psycopg -from databricks.sdk import WorkspaceClient - - -def get_connection(project_id: str, branch_id: str = "production", - endpoint_id: str = None, database_name: str = "databricks_postgres"): - """Get a one-shot database connection with a fresh OAuth token.""" - w = WorkspaceClient() - - if endpoint_id: - ep_name = f"projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id}" - else: - # Pick the first endpoint under the branch - endpoints = list(w.postgres.list_endpoints( - parent=f"projects/{project_id}/branches/{branch_id}" - )) - ep_name = endpoints[0].name - - endpoint = w.postgres.get_endpoint(name=ep_name) - host = endpoint.status.hosts.host - - cred = w.postgres.generate_database_credential(endpoint=ep_name) - - return psycopg.connect( - host=host, - dbname=database_name, - user=w.current_user.me().user_name, - password=cred.token, - sslmode="require", - ) - - -# Usage -with get_connection("my-app") as conn: - with conn.cursor() as cur: - cur.execute("SELECT NOW()") - print(cur.fetchone()) -``` - -## 4. Static URL (Local Development Only) - -```python -import os -from sqlalchemy.ext.asyncio import create_async_engine - -# LAKEBASE_PG_URL=postgresql://user:password@host:5432/database - -def get_database_url() -> str: - url = os.environ.get("LAKEBASE_PG_URL", "") - if url.startswith("postgresql://"): - url = url.replace("postgresql://", "postgresql+psycopg://", 1) - return url - - -engine = create_async_engine( - get_database_url(), - pool_size=5, - connect_args={"sslmode": "require"}, -) -``` - -## DNS Resolution Workaround (macOS) - -Python's `socket.getaddrinfo()` can fail with long hostnames on macOS. Fall back to `dig`: - -```python -import subprocess -import socket - - -def resolve_hostname(hostname: str) -> str: - """Resolve hostname using dig command (macOS workaround).""" - try: - return socket.gethostbyname(hostname) - except socket.gaierror: - pass - - try: - result = subprocess.run( - ["dig", "+short", hostname], - capture_output=True, text=True, timeout=5, - ) - for ip in result.stdout.strip().split('\n'): - if ip and not ip.startswith(';'): - return ip - except Exception: - pass - - raise RuntimeError(f"Could not resolve hostname: {hostname}") - - -# Use with psycopg: set `host` for TLS SNI and `hostaddr` for the actual connection -conn_params = { - "host": hostname, - "hostaddr": resolve_hostname(hostname), - "dbname": database_name, - "user": username, - "password": token, - "sslmode": "require", -} -conn = psycopg.connect(**conn_params) -``` - -## Best Practices - -1. **Default to pattern 1** (`psycopg_pool.ConnectionPool` + `OAuthConnection`). It's the canonical Databricks App pattern, works out of the box, no background tasks. -2. **Prefer `max_lifetime=2700` over the 3600 default.** A 15-minute buffer before the 1-hour token expiry avoids handing out connections with nearly-expired tokens. Not a hard spec — the official tutorial doesn't set it; `databricks-ai-bridge` uses 2700. -3. **Always `sslmode=require`** on every connection (it's auto-injected as `PGSSLMODE` in Databricks Apps). -4. **Never use `config.token` / `oauth_token().access_token` as the PG password** — that's a workspace-scoped token. Use `generate_database_credential()` to mint a Lakebase-scoped one. -5. **Handle DNS issues on macOS** using the `hostaddr` workaround if your dev machine can't resolve Lakebase hostnames. -6. **Use context managers** (`with pool.connection() as conn:`) so connections are always returned to the pool. -7. **Expect 2-5 second wake-up latency** on the first query after scale-to-zero — retry with backoff. -8. **Log credential refresh events** in `OAuthConnection.connect()` during early development — makes token-related failures easy to spot. diff --git a/databricks-skills/databricks-lakebase-autoscale/connections.md b/databricks-skills/databricks-lakebase-autoscale/connections.md new file mode 100644 index 00000000..0831a788 --- /dev/null +++ b/databricks-skills/databricks-lakebase-autoscale/connections.md @@ -0,0 +1,212 @@ +# Lakebase Autoscaling connection patterns + +Order of preference: + +1. **Canonical:** `psycopg_pool.ConnectionPool` + `OAuthConnection` subclass + `max_lifetime=2700`. +2. **SQLAlchemy:** official `do_connect` auth hook; optionally rely on `pool_recycle`/`dispose()` rather than a background token loop. +3. **Direct `psycopg.connect`:** notebooks/one-shot scripts under 1 hour. +4. **Static Postgres URL/native password:** local/dev tools only, or tools unable to rotate OAuth credentials. + +## Authentication facts + +Lakebase OAuth database credentials: +- Mint with `WorkspaceClient().postgres.generate_database_credential(endpoint=...)`. +- Use `cred.token` as the Postgres password. +- Expire after about 1 hour. +- Expiry is enforced at login; already-open connections continue until closed by pool/platform timeouts. + +Critical warning: + +```python +# ✅ Lakebase-scoped credential: works for Postgres login +cred = w.postgres.generate_database_credential(endpoint=endpoint_name) +password = cred.token + +# ❌ Workspace-scoped token: fails at Postgres login +password = w.config.oauth_token().access_token +# also do not use WorkspaceClient().config.token +``` + +Always connect with `sslmode=require`. + +## 1. Canonical: psycopg pool + OAuthConnection + +Use for production Databricks Apps and most Python services. + +Key mechanics: +- The pool calls `OAuthConnection.connect()` whenever it opens a physical connection: initial fill, growth under load, recycle, replacement after failure. +- `connect()` mints a fresh Lakebase token just-in-time and injects it as `password`. +- `max_lifetime=2700` recycles physical connections after 45 minutes, before 1-hour token expiry. +- No background refresh thread/task is needed. + +Minimal skeleton: + +```python +import os +import psycopg +from psycopg_pool import ConnectionPool +from databricks.sdk import WorkspaceClient + +w = WorkspaceClient() + +class OAuthConnection(psycopg.Connection): + @classmethod + def connect(cls, conninfo="", **kwargs): + cred = w.postgres.generate_database_credential( + endpoint=os.environ["ENDPOINT_NAME"] + ) + kwargs["password"] = cred.token + return super().connect(conninfo, **kwargs) + +pool = ConnectionPool( + conninfo=( + f"dbname={os.environ['PGDATABASE']} " + f"user={os.environ['PGUSER']} " + f"host={os.environ['PGHOST']} " + f"port={os.environ.get('PGPORT', '5432')} " + f"sslmode={os.environ.get('PGSSLMODE', 'require')}" + ), + connection_class=OAuthConnection, + min_size=1, + max_size=10, + max_lifetime=2700, + open=True, +) +``` + +Prefer `2700`; it is a defensive convention. The official Databricks tutorial leaves `max_lifetime` unset; `databricks-ai-bridge` uses `2700`. + +For FastAPI or explicit startup: +- instantiate with `open=False` +- call `pool.open(wait=True, timeout=30.0)` in lifespan/startup +- call `pool.close()` on shutdown + +This also avoids relying on implicit open behavior. + +## Databricks Apps environment variables + +When adding a Lakebase/Postgres resource to a Databricks App, these are auto-injected for the **first** DB resource: + +```text +PGAPPNAME +PGHOST +PGPORT +PGDATABASE +PGUSER +PGSSLMODE +``` + +Gotchas: +- `PGUSER` is typically the app service principal client ID. +- Only the first database resource is auto-injected; additional resources need explicit `valueFrom`. +- `ENDPOINT_NAME` is **not** auto-injected. Add it manually because `generate_database_credential(endpoint=...)` requires the full endpoint path: + +```yaml +env: + - name: ENDPOINT_NAME + value: "projects//branches//endpoints/" +``` + +## 2. SQLAlchemy: official `do_connect` hook + +Use when the app is already built around SQLAlchemy. + +Important distinction: +- `do_connect` is the official Databricks-recommended SQLAlchemy credential injection hook and is used by `databricks-ai-bridge`. +- The community/extra-complexity variant is adding a background `asyncio.Task` token-refresh loop. Demote that loop, not `do_connect`. + +Recommended hook shape: + +```python +from sqlalchemy import event +from sqlalchemy.ext.asyncio import create_async_engine +from databricks.sdk import WorkspaceClient + +w = WorkspaceClient() +endpoint_name = "projects/my-app/branches/production/endpoints/ep-primary" +host = w.postgres.get_endpoint(name=endpoint_name).status.hosts.host +user = w.current_user.me().user_name + +engine = create_async_engine( + f"postgresql+psycopg://{user}@{host}:5432/databricks_postgres", + connect_args={"sslmode": "require"}, + pool_recycle=2700, +) + +@event.listens_for(engine.sync_engine, "do_connect") +def inject_lakebase_token(dialect, conn_rec, cargs, cparams): + cred = w.postgres.generate_database_credential(endpoint=endpoint_name) + cparams["password"] = cred.token +``` + +Notes: +- `do_connect` fires when SQLAlchemy opens a new DBAPI connection. +- `pool_recycle=2700` approximates the psycopg-pool pattern. +- If you need deterministic refresh, prefer scheduled `engine.dispose()` and let the next checkout re-open with `do_connect`. +- A background token cache/refresh task is optional complexity and can create stale-token races if implemented poorly. + +## 3. Direct psycopg for notebooks/scripts + +Only for short-lived sessions where connections are opened and used immediately. + +Recipe: +1. Build endpoint path. +2. `get_endpoint(...).status.hosts.host`. +3. `generate_database_credential(endpoint=endpoint_name)`. +4. `psycopg.connect(host=host, dbname="databricks_postgres", user=, password=cred.token, sslmode="require")`. + +Use `w.current_user.me().user_name` for user in notebooks/manual scripts. In Databricks Apps, prefer `PGUSER`. + +## 4. Static URL / native password + +Use only for local development, legacy tools, or clients that cannot rotate OAuth database credentials. For SQLAlchemy + psycopg3, normalize: + +```text +postgresql://... -> postgresql+psycopg://... +``` + +Still set `sslmode=require`. + +## Endpoint discovery + +Avoid hardcoding host if you can hardcode the endpoint name instead: + +```python +ep = w.postgres.get_endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary" +) +host = ep.status.hosts.host +``` + +If no endpoint ID is known, list under branch and choose deliberately: + +```python +endpoints = list(w.postgres.list_endpoints( + parent="projects/my-app/branches/production" +)) +``` + +Do not assume the first endpoint is the primary if read replicas exist; check endpoint type/status. + +## DNS workaround for macOS + +Some macOS/Python resolver combinations fail on long Lakebase hostnames. + +Workaround: +- Resolve the hostname externally, commonly with `dig +short `. +- Pass both: + - `host=` for TLS/SNI/certificate validation. + - `hostaddr=` for the actual TCP connection. + +psycopg3 supports `hostaddr`. + +## Timeouts, scale-to-zero, and retries + +Plan for: +- 1-hour Lakebase OAuth token lifetime at login. +- 24-hour idle connection timeout. +- 3-day maximum connection lifetime. +- Scale-to-zero wake-up latency; first connection/query after suspension may need retry/backoff. +- After suspension/reactivation: session context is reset, temp tables/prepared statements are gone, active transactions/connections are terminated. + +Use context managers so pooled connections return promptly. diff --git a/databricks-skills/databricks-lakebase-autoscale/operations.md b/databricks-skills/databricks-lakebase-autoscale/operations.md new file mode 100644 index 00000000..dc02e885 --- /dev/null +++ b/databricks-skills/databricks-lakebase-autoscale/operations.md @@ -0,0 +1,297 @@ +# Lakebase Autoscaling operations + +Use `WorkspaceClient().postgres` for Autoscaling projects, branches, endpoints, roles, and credentials. Most create/update/delete methods return long-running operations; call `.wait()`. + +```python +from databricks.sdk import WorkspaceClient +w = WorkspaceClient() +``` + +## Resource names + +```text +Project: projects/{project_id} +Branch: projects/{project_id}/branches/{branch_id} +Endpoint: projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id} +``` + +Project ID rules: +- 1–63 chars +- lowercase letters, digits, hyphens +- cannot start/end with hyphen +- immutable after creation + +Default database: `databricks_postgres`. + +## Projects + +Create: + +```python +from databricks.sdk.service.postgres import Project, ProjectSpec + +project = w.postgres.create_project( + project=Project(spec=ProjectSpec(display_name="My App", pg_version="17")), + project_id="my-app", +).wait() +``` + +Project defaults: +- `production` branch +- primary read-write endpoint +- `databricks_postgres` database +- role for creator’s Databricks identity +- production scale-to-zero disabled by default + +GET gotcha: effective properties are typically in `project.status`, not `project.spec`. + +Update requires `FieldMask`: + +```python +from databricks.sdk.service.postgres import FieldMask + +w.postgres.update_project( + name="projects/my-app", + project=Project( + name="projects/my-app", + spec=ProjectSpec(display_name="New Name"), + ), + update_mask=FieldMask(field_mask=["spec.display_name"]), +).wait() +``` + +Delete is destructive and permanent; delete dependent Unity Catalog catalogs/synced tables first where applicable: + +```python +w.postgres.delete_project(name="projects/my-app").wait() +``` + +## Branches + +Branches are copy-on-write isolated database environments. Use them for dev/test/staging, schema-change validation, point-in-time recovery workflows, and ephemeral CI. + +Create branch from current parent: + +```python +from databricks.sdk.service.postgres import Branch, BranchSpec, Duration + +branch = w.postgres.create_branch( + parent="projects/my-app", + branch=Branch(spec=BranchSpec( + source_branch="projects/my-app/branches/production", + ttl=Duration(seconds=604800), # or no_expiry=True + )), + branch_id="development", +).wait() +``` + +Keep: +- `ttl=Duration(seconds=...)` for ephemeral branches. +- `no_expiry=True` for permanent branches. +- Max expiration period: 30 days from current time. +- Only 10 unarchived branches per project. +- Protected branches cannot be deleted, reset, archived, or expired. +- Default branch cannot be deleted or expired. +- Branches with children cannot be deleted, reset, or expired; delete children first. +- Reset replaces branch data/schema with latest parent and interrupts connections. + +Protect production: + +```python +w.postgres.update_branch( + name="projects/my-app/branches/production", + branch=Branch( + name="projects/my-app/branches/production", + spec=BranchSpec(is_protected=True), + ), + update_mask=FieldMask(field_mask=["spec.is_protected"]), +).wait() +``` + +Reset/delete: + +```python +w.postgres.reset_branch(name="projects/my-app/branches/development").wait() +w.postgres.delete_branch(name="projects/my-app/branches/development").wait() +``` + +Branch status fields worth inspecting: +- `status.default` +- `status.is_protected` +- `status.current_state` +- `status.logical_size_bytes` +- `status.expire_time` + +## Endpoints / computes + +A compute endpoint runs Postgres for a branch. Each branch has at most one primary read-write endpoint and may have read-only replica endpoints. + +Create endpoint: + +```python +from databricks.sdk.service.postgres import Endpoint, EndpointSpec, EndpointType + +ep = w.postgres.create_endpoint( + parent="projects/my-app/branches/production", + endpoint=Endpoint(spec=EndpointSpec( + endpoint_type=EndpointType.ENDPOINT_TYPE_READ_WRITE, + autoscaling_limit_min_cu=0.5, + autoscaling_limit_max_cu=4.0, + )), + endpoint_id="ep-primary", +).wait() +``` + +Get host: + +```python +host = w.postgres.get_endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary" +).status.hosts.host +``` + +Resize with update mask: + +```python +w.postgres.update_endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary", + endpoint=Endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary", + spec=EndpointSpec( + autoscaling_limit_min_cu=2.0, + autoscaling_limit_max_cu=8.0, + ), + ), + update_mask=FieldMask(field_mask=[ + "spec.autoscaling_limit_min_cu", + "spec.autoscaling_limit_max_cu", + ]), +).wait() +``` + +Delete: + +```python +w.postgres.delete_endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary" +).wait() +``` + +## Compute sizing + +Autoscaling uses ~2 GB RAM per CU. + +| CU | Approx RAM | Max connections | +|---:|---:|---:| +| 0.5 | ~1 GB | 104 | +| 1 | ~2 GB | 209 | +| 4 | ~8 GB | 839 | +| 8 | ~16 GB | 1,678 | +| 16 | ~32 GB | 3,357 | +| 32 | ~64 GB | 4,000 | +| 64 | ~128 GB | 4,000 | +| 112 | ~224 GB | 4,000 | + +Rules: +- Autoscale range: 0.5–32 CU. +- `autoscaling_limit_max_cu - autoscaling_limit_min_cu <= 8`. +- Valid: 4–8, 8–16, 16–24. +- Invalid: 0.5–32. +- Large fixed-size computes: 36–112 CU; no autoscaling. +- Connection limit is based on max CU. +- Set min CU high enough for working-set cache and latency needs. + +## Scale-to-zero + +Defaults: +- `production`: disabled by default. +- Other branches: configurable. +- Default inactivity timeout: 5 minutes. +- Minimum inactivity timeout: 60 seconds. + +Wake-up: +- First connection wakes compute automatically. +- Apps should use retry/backoff for the brief reactivation period. +- Reactivated compute starts at minimum autoscaling size. + +Session reset after suspension: +- temp tables gone +- prepared statements gone +- in-memory stats/cache cleared +- session settings reset +- active transactions/connections terminated + +Disable scale-to-zero for latency-critical apps or apps relying on persistent session state. + +## Project limits + +| Resource | Limit | +|---|---:| +| Projects per workspace | 1000 | +| Branches per project | 500 | +| Unarchived branches | 10 | +| Root branches | 3 | +| Protected branches | 1 | +| Concurrently active computes | 20 | +| Postgres roles per branch | 500 | +| Postgres databases per branch | 500 | +| Logical data size per branch | 8 TB | +| Snapshots | 10 | +| Maximum history retention | 35 days | +| Minimum scale-to-zero time | 60 sec | + +## CLI names + +CLI mirrors the SDK under `databricks postgres`, for example: +- `create-project`, `get-project`, `list-projects`, `update-project`, `delete-project` +- `create-branch`, `list-branches`, `reset-branch`, `delete-branch` +- `create-endpoint`, `get-endpoint`, `list-endpoints`, `update-endpoint`, `delete-endpoint` + +## MCP tools + +Use `type="autoscale"` for Lakebase Autoscaling. + +### `manage_lakebase_database` + +Actions: +- `create_or_update`: requires `name`; useful params include `display_name`, `pg_version` +- `get`: requires `name` +- `list`: optional type filter +- `delete`: requires `name` + +Example intent: + +```python +manage_lakebase_database( + action="create_or_update", + name="my-app", + type="autoscale", + display_name="My Application", + pg_version="17", +) +``` + +### `manage_lakebase_branch` + +Actions: +- `create_or_update`: requires `project_name`, `branch_id` +- `delete`: requires full branch `name` + +Useful params: +- `source_branch` +- `ttl_seconds` +- `autoscaling_limit_min_cu` +- `autoscaling_limit_max_cu` +- `scale_to_zero_seconds` + +### `generate_lakebase_credential` + +Generate a Lakebase-scoped database credential: + +```python +generate_lakebase_credential( + endpoint="projects/my-app/branches/production/endpoints/ep-primary" +) +``` + +Use returned token as the Postgres password with `sslmode=require`. diff --git a/databricks-skills/databricks-lakebase-autoscale/projects.md b/databricks-skills/databricks-lakebase-autoscale/projects.md deleted file mode 100644 index 659207a4..00000000 --- a/databricks-skills/databricks-lakebase-autoscale/projects.md +++ /dev/null @@ -1,204 +0,0 @@ -# Lakebase Autoscaling Projects - -## Overview - -A project is the top-level container for Lakebase Autoscaling resources, including branches, computes, databases, and roles. Each project is isolated and contains its own Postgres version, compute defaults, and restore window settings. - -## Project Structure - -``` -Project - └── Branches (production, development, staging, etc.) - ├── Computes (R/W compute, read replicas) - ├── Roles (Postgres roles) - └── Databases (Postgres databases) -``` - -When a project is created, it includes by default: -- A `production` branch (the default branch) -- A primary read-write compute (8-32 CU, autoscaling enabled, scale-to-zero disabled) -- A `databricks_postgres` database -- A Postgres role for the creating user's Databricks identity - -## Resource Naming - -Projects follow a hierarchical naming convention: -``` -projects/{project_id} -``` - -**Resource ID requirements:** -- 1-63 characters long -- Lowercase letters, digits, and hyphens only -- Cannot start or end with a hyphen -- Cannot be changed after creation - -## Creating a Project - -### Python SDK - -```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.postgres import Project, ProjectSpec - -w = WorkspaceClient() - -# Create a project (long-running operation) -operation = w.postgres.create_project( - project=Project( - spec=ProjectSpec( - display_name="My Application", - pg_version="17" - ) - ), - project_id="my-app" -) - -# Wait for completion -result = operation.wait() -print(f"Created project: {result.name}") -print(f"Display name: {result.status.display_name}") -print(f"Postgres version: {result.status.pg_version}") -``` - -### CLI - -```bash -databricks postgres create-project \ - --project-id my-app \ - --json '{ - "spec": { - "display_name": "My Application", - "pg_version": "17" - } - }' -``` - -## Getting Project Details - -### Python SDK - -```python -project = w.postgres.get_project(name="projects/my-app") - -print(f"Project: {project.name}") -print(f"Display name: {project.status.display_name}") -print(f"Postgres version: {project.status.pg_version}") -``` - -### CLI - -```bash -databricks postgres get-project projects/my-app -``` - -**Note:** The `spec` field is not populated for GET operations. All properties are returned in the `status` field. - -## Listing Projects - -```python -projects = w.postgres.list_projects() - -for project in projects: - print(f"Project: {project.name}") - print(f" Display name: {project.status.display_name}") - print(f" Postgres version: {project.status.pg_version}") -``` - -## Updating a Project - -Updates require an `update_mask` specifying which fields to modify: - -```python -from databricks.sdk.service.postgres import Project, ProjectSpec, FieldMask - -# Update display name -operation = w.postgres.update_project( - name="projects/my-app", - project=Project( - name="projects/my-app", - spec=ProjectSpec( - display_name="My Updated Application" - ) - ), - update_mask=FieldMask(field_mask=["spec.display_name"]) -) -result = operation.wait() -``` - -### CLI - -```bash -databricks postgres update-project projects/my-app spec.display_name \ - --json '{ - "spec": { - "display_name": "My Updated Application" - } - }' -``` - -## Deleting a Project - -**WARNING:** Deleting a project is permanent and also deletes all branches, computes, databases, roles, and data. - -Delete all Unity Catalog catalogs and synced tables before deleting the project. - -```python -operation = w.postgres.delete_project(name="projects/my-app") -# This is a long-running operation -``` - -### CLI - -```bash -databricks postgres delete-project projects/my-app -``` - -## Project Settings - -### Compute Defaults - -Default settings for new primary computes: -- Compute size range (0.5-112 CU) -- Scale-to-zero timeout (default: 5 minutes) - -### Instant Restore - -Configure the restore window length (2-35 days). Longer windows increase storage costs. - -### Postgres Version - -Supports Postgres 16 and Postgres 17. - -## Project Limits - -| Resource | Limit | -|----------|-------| -| Concurrently active computes | 20 | -| Branches per project | 500 | -| Postgres roles per branch | 500 | -| Postgres databases per branch | 500 | -| Logical data size per branch | 8 TB | -| Projects per workspace | 1000 | -| Protected branches | 1 | -| Root branches | 3 | -| Unarchived branches | 10 | -| Snapshots | 10 | -| Maximum history retention | 35 days | -| Minimum scale-to-zero time | 60 seconds | - -## Long-Running Operations - -All create, update, and delete operations return a long-running operation (LRO). Use `.wait()` in the SDK to block until completion: - -```python -# Start operation -operation = w.postgres.create_project(...) - -# Wait for completion -result = operation.wait() - -# Or check status manually -op_status = w.postgres.get_operation(name=operation.name) -print(f"Done: {op_status.done}") -``` diff --git a/databricks-skills/databricks-lakebase-autoscale/reverse-etl.md b/databricks-skills/databricks-lakebase-autoscale/reverse-etl.md index f983eebb..949f91b6 100644 --- a/databricks-skills/databricks-lakebase-autoscale/reverse-etl.md +++ b/databricks-skills/databricks-lakebase-autoscale/reverse-etl.md @@ -1,56 +1,59 @@ -# Reverse ETL with Lakebase Autoscaling +# Reverse ETL / synced tables -## Overview +Reverse ETL syncs Unity Catalog Delta tables into Lakebase Autoscaling as PostgreSQL tables for OLTP access. -Reverse ETL allows you to sync data from Unity Catalog Delta tables into Lakebase Autoscaling as PostgreSQL tables. This enables OLTP access patterns on data processed in the Lakehouse. +Important namespace split: +- Lakebase Autoscaling infrastructure: `w.postgres` +- Synced tables: `w.database` -## How It Works +Reverse ETL is Delta-to-Postgres only; Postgres-to-Delta sync is not supported here. -Synced tables create a managed copy of Unity Catalog data in Lakebase: +## How synced tables work -1. A new Unity Catalog table (read-only, managed by the sync pipeline) -2. A Postgres table in Lakebase (queryable by applications) +A synced table creates/maintains: +1. A managed/read-only Unity Catalog table for pipeline state/output. +2. A PostgreSQL table in Lakebase queried by apps. -The sync pipeline uses managed Lakeflow Spark Declarative Pipelines to continuously update both tables. +The sync pipeline uses managed Lakeflow Spark Declarative Pipelines. -### Performance +Performance planning: +- Continuous writes: ~1,200 rows/sec per CU. +- Bulk writes: ~15,000 rows/sec per CU. +- Each synced table can use up to 16 Postgres connections. -- **Continuous writes:** ~1,200 rows/sec per CU -- **Bulk writes:** ~15,000 rows/sec per CU -- **Connections used:** Up to 16 per synced table +## Sync modes -## Sync Modes +| Mode | Behavior | Use when | CDF required | +|---|---|---|---| +| `SNAPSHOT` | one-time full copy | initial loads, historical copy, large replacement | no | +| `TRIGGERED` | scheduled/on-demand incremental updates | hourly/daily operational refresh | yes | +| `CONTINUOUS` | streaming updates, seconds latency | live applications | yes | -| Mode | Description | Best For | Notes | -|------|-------------|----------|-------| -| **Snapshot** | One-time full copy | Initial setup, historical analysis | 10x more efficient if modifying >10% of data | -| **Triggered** | Scheduled updates on demand | Dashboards updated hourly/daily | Requires CDF on source table | -| **Continuous** | Real-time streaming (seconds of latency) | Live applications | Highest cost, minimum 15s intervals, requires CDF | - -**Note:** Triggered and Continuous modes require Change Data Feed (CDF) enabled on the source table: +Triggered and Continuous require Delta Change Data Feed on the source table: ```sql -ALTER TABLE your_catalog.your_schema.your_table -SET TBLPROPERTIES (delta.enableChangeDataFeed = true) +ALTER TABLE catalog.schema.table +SET TBLPROPERTIES (delta.enableChangeDataFeed = true); ``` -## Creating Synced Tables +Snapshot can be more efficient when modifying >10% of the data. + +## Create a synced table -### Using Python SDK +Use `databricks.sdk.service.database` models: ```python from databricks.sdk import WorkspaceClient from databricks.sdk.service.database import ( - SyncedDatabaseTable, - SyncedTableSpec, NewPipelineSpec, + SyncedDatabaseTable, SyncedTableSchedulingPolicy, + SyncedTableSpec, ) w = WorkspaceClient() -# Create a synced table -synced_table = w.database.create_synced_database_table( +w.database.create_synced_database_table( SyncedDatabaseTable( name="lakebase_catalog.schema.synced_table", spec=SyncedTableSpec( @@ -59,55 +62,35 @@ synced_table = w.database.create_synced_database_table( scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED, new_pipeline_spec=NewPipelineSpec( storage_catalog="lakebase_catalog", - storage_schema="staging" - ) + storage_schema="staging", + ), ), ) ) -print(f"Created synced table: {synced_table.name}") ``` -### Using CLI - -```bash -databricks database create-synced-database-table \ - --json '{ - "name": "lakebase_catalog.schema.synced_table", - "spec": { - "source_table_full_name": "analytics.gold.user_profiles", - "primary_key_columns": ["user_id"], - "scheduling_policy": "TRIGGERED", - "new_pipeline_spec": { - "storage_catalog": "lakebase_catalog", - "storage_schema": "staging" - } - } - }' -``` - -## Checking Synced Table Status +Status: ```python -status = w.database.get_synced_database_table(name="lakebase_catalog.schema.synced_table") -print(f"State: {status.data_synchronization_status.detailed_state}") -print(f"Message: {status.data_synchronization_status.message}") +st = w.database.get_synced_database_table( + name="lakebase_catalog.schema.synced_table" +) +state = st.data_synchronization_status.detailed_state +message = st.data_synchronization_status.message ``` -## Deleting a Synced Table - -Delete from both Unity Catalog and Postgres: - -1. **Unity Catalog:** Delete from Catalog Explorer or SDK -2. **Postgres:** Drop the table to free storage +Deletion cleanup: +1. Delete the synced table / UC object. +2. Drop the Postgres target table if needed to free Lakebase storage. ```sql -DROP TABLE your_database.your_schema.your_table; +DROP TABLE schema.table; ``` -## Data Type Mapping +## Type mapping -| Unity Catalog Type | Postgres Type | -|-------------------|---------------| +| Unity Catalog | Postgres | +|---|---| | BIGINT | BIGINT | | BINARY | BYTEA | | BOOLEAN | BOOLEAN | @@ -126,52 +109,19 @@ DROP TABLE your_database.your_schema.your_table; | MAP | JSONB | | STRUCT | JSONB | -**Unsupported types:** GEOGRAPHY, GEOMETRY, VARIANT, OBJECT - -## Capacity Planning - -- **Connection usage:** Each synced table uses up to 16 connections -- **Size limits:** 2 TB total across all synced tables; recommend < 1 TB per table -- **Naming:** Database, schema, and table names only allow `[A-Za-z0-9_]+` -- **Schema evolution:** Only additive changes (e.g., adding columns) for Triggered/Continuous modes - -## Use Cases - -### Product Catalog for Web App - -```python -w.database.create_synced_database_table( - SyncedDatabaseTable( - name="ecommerce_catalog.public.products", - spec=SyncedTableSpec( - source_table_full_name="gold.products.catalog", - primary_key_columns=["product_id"], - scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED, - ), - ) -) -``` - -### Real-time Feature Serving - -```python -w.database.create_synced_database_table( - SyncedDatabaseTable( - name="ml_catalog.public.user_features", - spec=SyncedTableSpec( - source_table_full_name="ml.features.user_features", - primary_key_columns=["user_id"], - scheduling_policy=SyncedTableSchedulingPolicy.CONTINUOUS, - ), - ) -) -``` - -## Best Practices - -1. **Enable CDF** on source tables before creating Triggered or Continuous synced tables -2. **Choose appropriate sync mode**: Snapshot for small tables, Triggered for hourly/daily, Continuous for real-time -3. **Monitor sync status**: Check for failures and latency via Catalog Explorer -4. **Index target tables**: Create appropriate indexes in Postgres for your query patterns -5. **Handle schema changes**: Only additive changes are supported for streaming modes -6. **Account for connection limits**: Each synced table uses up to 16 connections +Unsupported: +- `GEOGRAPHY` +- `GEOMETRY` +- `VARIANT` +- `OBJECT` + +## Limits and gotchas + +- Up to 16 Postgres connections per synced table; include this in endpoint connection-capacity planning. +- Size limit: 2 TB total across all synced tables. +- Recommended: <1 TB per synced table. +- Database/schema/table names: `[A-Za-z0-9_]+`. +- Triggered/Continuous schema evolution: additive changes only. +- Create indexes in Postgres for application query patterns after sync. +- Monitor detailed sync state in Catalog Explorer or with `get_synced_database_table`. +- Delete synced-table dependencies before deleting the Lakebase project. From 77d823a8e5061a5f3017f216b603d21d3a11eb81 Mon Sep 17 00:00:00 2001 From: David O'Keeffe Date: Sat, 9 May 2026 15:16:12 +1000 Subject: [PATCH 6/6] fix(lakebase-autoscale): correct CU range / spread / fixed-size bounds Two factual errors that pre-dated this PR (originally in computes.md, preserved through the densification pass): - Autoscale spread: was "max - min <= 8", correct is "max - min <= 16" - Fixed-size always-on compute floor: was 36 CU, correct is 40 CU - Updated "Valid / Invalid" examples to match the <= 16 spread rule (4-20, 8-16, 16-32; invalid 0.5-32 has spread 31.5) Source: Lakebase Autoscaling tutorial / Dustin's official Genie Code Lakebase skill draft, confirmed by user. Co-authored-by: Isaac --- databricks-skills/databricks-lakebase-autoscale/SKILL.md | 4 ++-- .../databricks-lakebase-autoscale/operations.md | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/databricks-skills/databricks-lakebase-autoscale/SKILL.md b/databricks-skills/databricks-lakebase-autoscale/SKILL.md index e8a4d61a..8d7dd6f5 100644 --- a/databricks-skills/databricks-lakebase-autoscale/SKILL.md +++ b/databricks-skills/databricks-lakebase-autoscale/SKILL.md @@ -94,8 +94,8 @@ Most create/update/delete calls return long-running operations; call `.wait()`. - Postgres versions: **16 and 17**. - AWS regions: `us-east-1`, `us-east-2`, `eu-central-1`, `eu-west-1`, `eu-west-2`, `ap-south-1`, `ap-southeast-1`, `ap-southeast-2`. - Azure beta regions: `eastus2`, `westeurope`, `westus`. -- Autoscaling computes: 0.5–32 CU with `max - min <= 8`. -- Large fixed computes: 36–112 CU. +- Autoscaling computes: 0.5–32 CU with `max - min <= 16`. +- Fixed-size always-on computes: 40–112 CU. - Autoscaling CU ≈ 2 GB RAM. - `sslmode=require` on all driver connections. - Endpoint host comes from `w.postgres.get_endpoint(...).status.hosts.host`. diff --git a/databricks-skills/databricks-lakebase-autoscale/operations.md b/databricks-skills/databricks-lakebase-autoscale/operations.md index dc02e885..982bfb58 100644 --- a/databricks-skills/databricks-lakebase-autoscale/operations.md +++ b/databricks-skills/databricks-lakebase-autoscale/operations.md @@ -194,10 +194,10 @@ Autoscaling uses ~2 GB RAM per CU. Rules: - Autoscale range: 0.5–32 CU. -- `autoscaling_limit_max_cu - autoscaling_limit_min_cu <= 8`. -- Valid: 4–8, 8–16, 16–24. -- Invalid: 0.5–32. -- Large fixed-size computes: 36–112 CU; no autoscaling. +- `autoscaling_limit_max_cu - autoscaling_limit_min_cu <= 16`. +- Valid: 4–20, 8–16, 16–32. +- Invalid: 0.5–32 (spread of 31.5 exceeds 16). +- Fixed-size always-on computes: 40–112 CU; no autoscaling. - Connection limit is based on max CU. - Set min CU high enough for working-set cache and latency needs.