diff --git a/databricks-skills/databricks-lakebase-autoscale/SKILL.md b/databricks-skills/databricks-lakebase-autoscale/SKILL.md index f471765c..8d7dd6f5 100644 --- a/databricks-skills/databricks-lakebase-autoscale/SKILL.md +++ b/databricks-skills/databricks-lakebase-autoscale/SKILL.md @@ -5,330 +5,129 @@ description: "Patterns and best practices for Lakebase Autoscaling (next-gen man # Lakebase Autoscaling -Patterns and best practices for using Lakebase Autoscaling, the next-generation managed PostgreSQL on Databricks with autoscaling compute, branching, scale-to-zero, and instant restore. +Lakebase Autoscaling is Databricks' next-generation managed PostgreSQL service for OLTP workloads: autoscaling compute, database branching, scale-to-zero, instant restore, and Delta-to-Postgres synced tables. -## When to Use +Use this skill when creating/managing Lakebase Autoscaling projects, branches, endpoints/computes, credentials, reverse ETL synced tables, or app connections. -Use this skill when: -- Building applications that need a PostgreSQL database with autoscaling compute -- Working with database branching for dev/test/staging workflows -- Adding persistent state to applications with scale-to-zero cost savings -- Implementing reverse ETL from Delta Lake to an operational database via synced tables -- Managing Lakebase Autoscaling projects, branches, computes, or credentials +## Core framing -## Overview +> **There is no separate Python “Lakebase SDK.”** Use `databricks-sdk` for management and for minting short-lived database credentials with `WorkspaceClient().postgres.generate_database_credential(...)`; use standard Postgres drivers (`psycopg`, SQLAlchemy, JDBC, `pgx`, etc.) for SQL. -Lakebase Autoscaling is Databricks' next-generation managed PostgreSQL service for OLTP workloads. It provides autoscaling compute, Git-like branching, scale-to-zero, and instant point-in-time restore. +| Language | Credential / management SDK | DB driver / wrapper | +|---|---|---| +| **Python** | `databricks-sdk` `WorkspaceClient().postgres` | `psycopg[binary,pool]` canonical; SQLAlchemy supported | +| **Node/TS** | `@databricks/lakebase` convenience wrapper, Autoscaling only | Wrapper manages `pg` pool | +| **Java/Go** | Databricks SDK for Java/Go | Standard JDBC / `pgx` | -| Feature | Description | -|---------|-------------| -| **Autoscaling Compute** | 0.5-112 CU with 2 GB RAM per CU; scales dynamically based on load | -| **Scale-to-Zero** | Compute suspends after configurable inactivity timeout | -| **Branching** | Create isolated database environments (like Git branches) for dev/test | -| **Instant Restore** | Point-in-time restore from any moment within the configured window (up to 35 days) | -| **OAuth Authentication** | Token-based auth via Databricks SDK (1-hour expiry) | -| **Reverse ETL** | Sync data from Delta tables to PostgreSQL via synced tables | +## Lead connection pattern -**Available Regions (AWS):** us-east-1, us-east-2, eu-central-1, eu-west-1, eu-west-2, ap-south-1, ap-southeast-1, ap-southeast-2 +For production Python apps, start with: -**Available Regions (Azure Beta):** eastus2, westeurope, westus +1. `psycopg_pool.ConnectionPool` +2. `connection_class=OAuthConnection`, where `OAuthConnection(psycopg.Connection).connect()` calls `w.postgres.generate_database_credential(endpoint=...)` +3. `max_lifetime=2700` -## Project Hierarchy +This is the canonical pattern from the official Databricks Apps + Lakebase Autoscaling tutorial lineage and `databricks-ai-bridge`: no background token thread; physical connections get fresh credentials when opened/recycled. -Understanding the hierarchy is essential for working with Lakebase Autoscaling: +Prefer `max_lifetime=2700` as a defensive 45-minute recycle before 1-hour token expiry. The official tutorial does not set `max_lifetime`; `databricks-ai-bridge` uses `2700`. -``` -Project (top-level container) - └── Branch(es) (isolated database environments) - ├── Compute (primary R/W endpoint) - ├── Read Replica(s) (optional, read-only) - ├── Role(s) (Postgres roles) - └── Database(s) (Postgres databases) - └── Schema(s) -``` +See `connections.md`. -| Object | Description | -|--------|-------------| -| **Project** | Top-level container. Created via `w.postgres.create_project()`. | -| **Branch** | Isolated database environment with copy-on-write storage. Default branch is `production`. | -| **Compute** | Postgres server powering a branch. Configurable CU sizing and autoscaling. | -| **Database** | Standard Postgres database within a branch. Default is `databricks_postgres`. | +## Critical auth warning -## Quick Start +Do **not** use `WorkspaceClient().config.token`, `w.config.oauth_token().access_token`, or any workspace-scoped OAuth token as the Postgres password. It will fail at Postgres login. -Create a project and connect: +Use: ```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.postgres import Project, ProjectSpec - -w = WorkspaceClient() - -# Create a project (long-running operation) -operation = w.postgres.create_project( - project=Project( - spec=ProjectSpec( - display_name="My Application", - pg_version="17" - ) - ), - project_id="my-app" -) -result = operation.wait() -print(f"Created project: {result.name}") +cred = WorkspaceClient().postgres.generate_database_credential(endpoint=endpoint_name) +password = cred.token ``` -## Common Patterns - -### Generate OAuth Token - -```python -from databricks.sdk import WorkspaceClient - -w = WorkspaceClient() - -# Generate database credential for connecting (optionally scoped to an endpoint) -cred = w.postgres.generate_database_credential( - endpoint="projects/my-app/branches/production/endpoints/ep-primary" -) -token = cred.token # Use as password in connection string -# Token expires after 1 hour -``` - -### Connect from Notebook - -```python -import psycopg -from databricks.sdk import WorkspaceClient - -w = WorkspaceClient() - -# Get endpoint details -endpoint = w.postgres.get_endpoint( - name="projects/my-app/branches/production/endpoints/ep-primary" -) -host = endpoint.status.hosts.host - -# Generate token (scoped to endpoint) -cred = w.postgres.generate_database_credential( - endpoint="projects/my-app/branches/production/endpoints/ep-primary" -) - -# Connect using psycopg3 -conn_string = ( - f"host={host} " - f"dbname=databricks_postgres " - f"user={w.current_user.me().user_name} " - f"password={cred.token} " - f"sslmode=require" -) -with psycopg.connect(conn_string) as conn: - with conn.cursor() as cur: - cur.execute("SELECT version()") - print(cur.fetchone()) -``` +That token is Lakebase-scoped and is used as the Postgres password with `sslmode=require`. -### Create a Branch for Development +## Resource model -```python -from databricks.sdk.service.postgres import Branch, BranchSpec, Duration - -# Create a dev branch with 7-day expiration -branch = w.postgres.create_branch( - parent="projects/my-app", - branch=Branch( - spec=BranchSpec( - source_branch="projects/my-app/branches/production", - ttl=Duration(seconds=604800) # 7 days - ) - ), - branch_id="development" -).wait() -print(f"Branch created: {branch.name}") +```text +Project + └── Branches + ├── Endpoint/Compute: primary read-write endpoint + ├── Read replicas: optional read-only endpoints + ├── Roles + └── Databases + └── Schemas/Tables ``` -### Resize Compute (Autoscaling) +Canonical names: -```python -from databricks.sdk.service.postgres import Endpoint, EndpointSpec, FieldMask - -# Update compute to autoscale between 2-8 CU -w.postgres.update_endpoint( - name="projects/my-app/branches/production/endpoints/ep-primary", - endpoint=Endpoint( - name="projects/my-app/branches/production/endpoints/ep-primary", - spec=EndpointSpec( - autoscaling_limit_min_cu=2.0, - autoscaling_limit_max_cu=8.0 - ) - ), - update_mask=FieldMask(field_mask=[ - "spec.autoscaling_limit_min_cu", - "spec.autoscaling_limit_max_cu" - ]) -).wait() +```text +projects/{project_id} +projects/{project_id}/branches/{branch_id} +projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id} ``` -## MCP Tools +Defaults on project creation: +- default branch: `production` +- default database: `databricks_postgres` +- primary read-write endpoint/compute +- Postgres role for the creator’s Databricks identity -The following MCP tools are available for managing Lakebase infrastructure. Use `type="autoscale"` for Lakebase Autoscaling. +Key SDK namespace: `WorkspaceClient().postgres`. -### manage_lakebase_database - Project Management +Most create/update/delete calls return long-running operations; call `.wait()`. -| Action | Description | Required Params | -|--------|-------------|-----------------| -| `create_or_update` | Create or update a project | name | -| `get` | Get project details (includes branches/endpoints) | name | -| `list` | List all projects | (none, optional type filter) | -| `delete` | Delete project and all branches/computes/data | name | - -**Example usage:** -```python -# Create an autoscale project -manage_lakebase_database( - action="create_or_update", - name="my-app", - type="autoscale", - display_name="My Application", - pg_version="17" -) - -# Get project with branches -manage_lakebase_database(action="get", name="my-app", type="autoscale") - -# Delete project -manage_lakebase_database(action="delete", name="my-app", type="autoscale") -``` - -### manage_lakebase_branch - Branch Management - -| Action | Description | Required Params | -|--------|-------------|-----------------| -| `create_or_update` | Create/update branch with compute endpoint | project_name, branch_id | -| `delete` | Delete branch and endpoints | name (full branch name) | - -**Example usage:** -```python -# Create a dev branch with 7-day TTL -manage_lakebase_branch( - action="create_or_update", - project_name="my-app", - branch_id="development", - source_branch="production", - ttl_seconds=604800, # 7 days - autoscaling_limit_min_cu=0.5, - autoscaling_limit_max_cu=4.0, - scale_to_zero_seconds=300 -) - -# Delete branch -manage_lakebase_branch(action="delete", name="projects/my-app/branches/development") -``` - -### generate_lakebase_credential - OAuth Tokens - -Generate OAuth token (~1hr) for PostgreSQL connections. Use as password with `sslmode=require`. - -```python -# For autoscale endpoints -generate_lakebase_credential(endpoint="projects/my-app/branches/production/endpoints/ep-primary") -``` - -## Reference Files - -- [projects.md](projects.md) - Project management patterns and settings -- [branches.md](branches.md) - Branching workflows, protection, and expiration -- [computes.md](computes.md) - Compute sizing, autoscaling, and scale-to-zero -- [connection-patterns.md](connection-patterns.md) - Connection patterns for different use cases -- [reverse-etl.md](reverse-etl.md) - Synced tables from Delta Lake to Lakebase - -## CLI Quick Reference - -```bash -# Create a project -databricks postgres create-project \ - --project-id my-app \ - --json '{"spec": {"display_name": "My App", "pg_version": "17"}}' - -# List projects -databricks postgres list-projects - -# Get project details -databricks postgres get-project projects/my-app - -# Create a branch -databricks postgres create-branch projects/my-app development \ - --json '{"spec": {"source_branch": "projects/my-app/branches/production", "no_expiry": true}}' - -# List branches -databricks postgres list-branches projects/my-app - -# Get endpoint details -databricks postgres get-endpoint projects/my-app/branches/production/endpoints/ep-primary - -# Delete a project -databricks postgres delete-project projects/my-app -``` - -## Key Differences from Lakebase Provisioned +## Lakebase Autoscaling vs Provisioned | Aspect | Provisioned | Autoscaling | -|--------|-------------|-------------| +|---|---|---| | SDK module | `w.database` | `w.postgres` | | Top-level resource | Instance | Project | -| Capacity | CU_1, CU_2, CU_4, CU_8 (16 GB/CU) | 0.5-112 CU (2 GB/CU) | -| Branching | Not supported | Full branching support | -| Scale-to-zero | Not supported | Configurable timeout | -| Operations | Synchronous | Long-running operations (LRO) | -| Read replicas | Readable secondaries | Dedicated read-only endpoints | - -## Common Issues - -| Issue | Solution | -|-------|----------| -| **Token expired during long query** | Implement token refresh loop; tokens expire after 1 hour | -| **Connection refused after scale-to-zero** | Compute wakes automatically on connection; reactivation takes a few hundred ms; implement retry logic | -| **DNS resolution fails on macOS** | Use `dig` command to resolve hostname, pass `hostaddr` to psycopg | -| **Branch deletion blocked** | Delete child branches first; cannot delete branches with children | -| **Autoscaling range too wide** | Max - min cannot exceed 8 CU (e.g., 8-16 CU is valid, 0.5-32 CU is not) | -| **SSL required error** | Always use `sslmode=require` in connection string | -| **Update mask required** | All update operations require an `update_mask` specifying fields to modify | -| **Connection closed after 24h idle** | All connections have a 24-hour idle timeout and 3-day max lifetime; implement retry logic | - -## Current Limitations - -These features are NOT yet supported in Lakebase Autoscaling: -- High availability with readable secondaries (use read replicas instead) -- Databricks Apps UI integration (Apps can connect manually via credentials) -- Feature Store integration -- Stateful AI agents (LangChain memory) -- Postgres-to-Delta sync (only Delta-to-Postgres reverse ETL) -- Custom billing tags and serverless budget policies -- Direct migration from Lakebase Provisioned (use pg_dump/pg_restore or reverse ETL) - -## SDK Version Requirements - -- **Databricks SDK for Python**: >= 0.81.0 (for `w.postgres` module) -- **psycopg**: 3.x (supports `hostaddr` parameter for DNS workaround) -- **SQLAlchemy**: 2.x with `postgresql+psycopg` driver +| Capacity | fixed CU tiers, ~16 GB/CU | 0.5–112 CU, ~2 GB/CU | +| Branching | no | yes | +| Scale-to-zero | no | yes | +| Operations | mostly synchronous | LROs; use `.wait()` | +| Reverse ETL | synced tables | synced tables | +| Read replicas | readable secondaries | dedicated read-only endpoints | + +## Non-obvious facts to preserve + +- Postgres versions: **16 and 17**. +- AWS regions: `us-east-1`, `us-east-2`, `eu-central-1`, `eu-west-1`, `eu-west-2`, `ap-south-1`, `ap-southeast-1`, `ap-southeast-2`. +- Azure beta regions: `eastus2`, `westeurope`, `westus`. +- Autoscaling computes: 0.5–32 CU with `max - min <= 16`. +- Fixed-size always-on computes: 40–112 CU. +- Autoscaling CU ≈ 2 GB RAM. +- `sslmode=require` on all driver connections. +- Endpoint host comes from `w.postgres.get_endpoint(...).status.hosts.host`. +- GET responses often return effective properties under `status`; create/update payloads use `spec`. +- All update calls need a `FieldMask`. +- Scale-to-zero wake-up is automatic but apps should retry. +- Connections can be closed by platform timeouts: 24-hour idle timeout and 3-day max connection lifetime. +- macOS DNS can fail on long Lakebase hostnames; if so, resolve to IP and pass both `host` and `hostaddr` to psycopg. +- Triggered/Continuous synced tables require Delta Change Data Feed. +- Reverse ETL is Delta-to-Postgres only; not Postgres-to-Delta. + +## Task files + +- `connections.md` — app/notebook connection patterns and credential rotation. +- `operations.md` — project, branch, endpoint/compute, scale-to-zero, limits, MCP mapping. +- `reverse-etl.md` — synced tables from Delta Lake to Lakebase. + +## SDK / package versions -```python -%pip install -U "databricks-sdk>=0.81.0" "psycopg[binary]>=3.0" sqlalchemy +```bash +pip install -U "databricks-sdk>=0.81.0" "psycopg[binary,pool]>=3.1" "sqlalchemy>=2" ``` -## Notes - -- **Compute Units** in Autoscaling provide ~2 GB RAM each (vs 16 GB in Provisioned). -- **Resource naming** follows hierarchical paths: `projects/{id}/branches/{id}/endpoints/{id}`. -- All create/update/delete operations are **long-running** -- use `.wait()` in the SDK. -- Tokens are short-lived (1 hour) -- production apps MUST implement token refresh. -- **Postgres versions** 16 and 17 are supported. +Use SQLAlchemy URL prefix `postgresql+psycopg://...` for psycopg3. -## Related Skills +## Current limitations -- **[databricks-lakebase-provisioned](../databricks-lakebase-provisioned/SKILL.md)** - fixed-capacity managed PostgreSQL (predecessor) -- **[databricks-app-apx](../databricks-app-apx/SKILL.md)** - full-stack apps that can use Lakebase for persistence -- **[databricks-app-python](../databricks-app-python/SKILL.md)** - Python apps with Lakebase backend -- **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** - SDK used for project management and token generation -- **[databricks-bundles](../databricks-bundles/SKILL.md)** - deploying apps with Lakebase resources -- **[databricks-jobs](../databricks-jobs/SKILL.md)** - scheduling reverse ETL sync jobs +Not yet supported or not equivalent to Provisioned: +- High availability with readable secondaries; use read replicas instead. +- Databricks Apps UI integration may lag; Apps can connect manually via credentials/resource env vars. +- Feature Store integration. +- Stateful AI-agent memory integrations. +- Postgres-to-Delta sync. +- Custom billing tags / serverless budget policies. +- Direct migration from Lakebase Provisioned; use `pg_dump`/`pg_restore` or reverse ETL patterns where appropriate. diff --git a/databricks-skills/databricks-lakebase-autoscale/branches.md b/databricks-skills/databricks-lakebase-autoscale/branches.md deleted file mode 100644 index f44f7234..00000000 --- a/databricks-skills/databricks-lakebase-autoscale/branches.md +++ /dev/null @@ -1,212 +0,0 @@ -# Lakebase Autoscaling Branches - -## Overview - -Branches in Lakebase Autoscaling are isolated database environments that share storage with their parent through copy-on-write. They enable Git-like workflows for databases: create isolated dev/test environments, test schema changes safely, and recover from mistakes. - -## Branch Types - -| Option | Description | Use Case | -|--------|-------------|----------| -| **Current data** | Branch from latest state of parent | Development, testing with current data | -| **Past data** | Branch from a specific point in time | Point-in-time recovery, historical analysis | - -## Creating a Branch - -### With Expiration (TTL) - -```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.postgres import Branch, BranchSpec, Duration - -w = WorkspaceClient() - -# Create branch with 7-day expiration -result = w.postgres.create_branch( - parent="projects/my-app", - branch=Branch( - spec=BranchSpec( - source_branch="projects/my-app/branches/production", - ttl=Duration(seconds=604800) # 7 days - ) - ), - branch_id="development" -).wait() - -print(f"Branch created: {result.name}") -print(f"Expires: {result.status.expire_time}") -``` - -### Permanent Branch (No Expiration) - -```python -result = w.postgres.create_branch( - parent="projects/my-app", - branch=Branch( - spec=BranchSpec( - source_branch="projects/my-app/branches/production", - no_expiry=True - ) - ), - branch_id="staging" -).wait() -``` - -### CLI - -```bash -# With TTL -databricks postgres create-branch projects/my-app development \ - --json '{ - "spec": { - "source_branch": "projects/my-app/branches/production", - "ttl": "604800s" - } - }' - -# Permanent -databricks postgres create-branch projects/my-app staging \ - --json '{ - "spec": { - "source_branch": "projects/my-app/branches/production", - "no_expiry": true - } - }' -``` - -## Getting Branch Details - -```python -branch = w.postgres.get_branch( - name="projects/my-app/branches/development" -) - -print(f"Branch: {branch.name}") -print(f"Protected: {branch.status.is_protected}") -print(f"Default: {branch.status.default}") -print(f"State: {branch.status.current_state}") -print(f"Size: {branch.status.logical_size_bytes} bytes") -``` - -## Listing Branches - -```python -branches = list(w.postgres.list_branches( - parent="projects/my-app" -)) - -for branch in branches: - print(f"Branch: {branch.name}") - print(f" Default: {branch.status.default}") - print(f" Protected: {branch.status.is_protected}") -``` - -## Protecting a Branch - -Protected branches cannot be deleted, reset, or archived. - -```python -from databricks.sdk.service.postgres import Branch, BranchSpec, FieldMask - -w.postgres.update_branch( - name="projects/my-app/branches/production", - branch=Branch( - name="projects/my-app/branches/production", - spec=BranchSpec(is_protected=True) - ), - update_mask=FieldMask(field_mask=["spec.is_protected"]) -).wait() -``` - -To remove protection: - -```python -w.postgres.update_branch( - name="projects/my-app/branches/production", - branch=Branch( - name="projects/my-app/branches/production", - spec=BranchSpec(is_protected=False) - ), - update_mask=FieldMask(field_mask=["spec.is_protected"]) -).wait() -``` - -## Updating Branch Expiration - -```python -# Extend to 14 days -w.postgres.update_branch( - name="projects/my-app/branches/development", - branch=Branch( - name="projects/my-app/branches/development", - spec=BranchSpec( - is_protected=False, - ttl=Duration(seconds=1209600) # 14 days - ) - ), - update_mask=FieldMask(field_mask=["spec.is_protected", "spec.expiration"]) -).wait() - -# Remove expiration -w.postgres.update_branch( - name="projects/my-app/branches/development", - branch=Branch( - name="projects/my-app/branches/development", - spec=BranchSpec(no_expiry=True) - ), - update_mask=FieldMask(field_mask=["spec.expiration"]) -).wait() -``` - -## Resetting a Branch from Parent - -Reset completely replaces a branch's data and schema with the latest from its parent. Local changes are lost. - -```python -w.postgres.reset_branch( - name="projects/my-app/branches/development" -).wait() -``` - -**Constraints:** -- Root branches (like `production`) cannot be reset (no parent) -- Branches with children cannot be reset (delete children first) -- Connections are temporarily interrupted during reset - -## Deleting a Branch - -```python -w.postgres.delete_branch( - name="projects/my-app/branches/development" -).wait() -``` - -**Constraints:** -- Cannot delete branches with child branches (delete children first) -- Cannot delete protected branches (remove protection first) -- Cannot delete the default branch - -## Branch Expiration - -Branch expiration sets an automatic deletion timestamp. Useful for: -- **CI/CD environments**: 2-4 hours -- **Demos**: 24-48 hours -- **Feature development**: 1-7 days -- **Long-term testing**: up to 30 days - -**Maximum expiration period:** 30 days from current time. - -### Expiration Restrictions - -- Cannot expire protected branches -- Cannot expire default branches -- Cannot expire branches that have children -- When a branch expires, all compute resources are also deleted - -## Best Practices - -1. **Use TTL for ephemeral branches**: Set expiration for dev/test branches to avoid accumulation -2. **Protect production branches**: Prevent accidental deletion or reset -3. **Reset instead of recreate**: Use reset from parent when you need fresh data without new branch overhead -4. **Schema diff before merge**: Compare schemas between branches before applying changes to production -5. **Monitor unarchived limit**: Only 10 unarchived branches are allowed per project diff --git a/databricks-skills/databricks-lakebase-autoscale/computes.md b/databricks-skills/databricks-lakebase-autoscale/computes.md deleted file mode 100644 index 0f53d50c..00000000 --- a/databricks-skills/databricks-lakebase-autoscale/computes.md +++ /dev/null @@ -1,208 +0,0 @@ -# Lakebase Autoscaling Computes - -## Overview - -A compute is a virtualized service that runs Postgres for a branch. Each branch has one primary read-write compute and can have optional read replicas. Computes support autoscaling, scale-to-zero, and granular sizing from 0.5 to 112 CU. - -## Compute Sizing - -Each Compute Unit (CU) allocates approximately 2 GB of RAM. - -### Available Sizes - -| Category | Range | Notes | -|----------|-------|-------| -| **Autoscale computes** | 0.5-32 CU | Dynamic scaling within range (max-min <= 8 CU) | -| **Large fixed-size** | 36-112 CU | Fixed size, no autoscaling | - -### Representative Sizes - -| Compute Units | RAM | Max Connections | -|--------------|-----|-----------------| -| 0.5 CU | ~1 GB | 104 | -| 1 CU | ~2 GB | 209 | -| 4 CU | ~8 GB | 839 | -| 8 CU | ~16 GB | 1,678 | -| 16 CU | ~32 GB | 3,357 | -| 32 CU | ~64 GB | 4,000 | -| 64 CU | ~128 GB | 4,000 | -| 112 CU | ~224 GB | 4,000 | - -**Note:** Lakebase Provisioned used ~16 GB per CU. Autoscaling uses ~2 GB per CU for more granular scaling. - -## Creating a Compute - -```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.postgres import Endpoint, EndpointSpec, EndpointType - -w = WorkspaceClient() - -# Create a read-write compute endpoint -result = w.postgres.create_endpoint( - parent="projects/my-app/branches/production", - endpoint=Endpoint( - spec=EndpointSpec( - endpoint_type=EndpointType.ENDPOINT_TYPE_READ_WRITE, - autoscaling_limit_min_cu=0.5, - autoscaling_limit_max_cu=4.0 - ) - ), - endpoint_id="my-compute" -).wait() - -print(f"Endpoint created: {result.name}") -print(f"Host: {result.status.hosts.host}") -``` - -### CLI - -```bash -databricks postgres create-endpoint \ - projects/my-app/branches/production my-compute \ - --json '{ - "spec": { - "endpoint_type": "ENDPOINT_TYPE_READ_WRITE", - "autoscaling_limit_min_cu": 0.5, - "autoscaling_limit_max_cu": 4.0 - } - }' -``` - -**Important:** Each branch can have only one read-write compute. - -## Getting Compute Details - -```python -endpoint = w.postgres.get_endpoint( - name="projects/my-app/branches/production/endpoints/my-compute" -) - -print(f"Endpoint: {endpoint.name}") -print(f"Type: {endpoint.status.endpoint_type}") -print(f"State: {endpoint.status.current_state}") -print(f"Host: {endpoint.status.hosts.host}") -print(f"Min CU: {endpoint.status.autoscaling_limit_min_cu}") -print(f"Max CU: {endpoint.status.autoscaling_limit_max_cu}") -``` - -## Listing Computes - -```python -endpoints = list(w.postgres.list_endpoints( - parent="projects/my-app/branches/production" -)) - -for ep in endpoints: - print(f"Endpoint: {ep.name}") - print(f" Type: {ep.status.endpoint_type}") - print(f" CU Range: {ep.status.autoscaling_limit_min_cu}-{ep.status.autoscaling_limit_max_cu}") -``` - -## Resizing a Compute - -Use `update_mask` to specify which fields to update: - -```python -from databricks.sdk.service.postgres import Endpoint, EndpointSpec, FieldMask - -# Update min and max CU -w.postgres.update_endpoint( - name="projects/my-app/branches/production/endpoints/my-compute", - endpoint=Endpoint( - name="projects/my-app/branches/production/endpoints/my-compute", - spec=EndpointSpec( - autoscaling_limit_min_cu=2.0, - autoscaling_limit_max_cu=8.0 - ) - ), - update_mask=FieldMask(field_mask=[ - "spec.autoscaling_limit_min_cu", - "spec.autoscaling_limit_max_cu" - ]) -).wait() -``` - -### CLI - -```bash -# Update single field -databricks postgres update-endpoint \ - projects/my-app/branches/production/endpoints/my-compute \ - spec.autoscaling_limit_max_cu \ - --json '{"spec": {"autoscaling_limit_max_cu": 8.0}}' - -# Update multiple fields -databricks postgres update-endpoint \ - projects/my-app/branches/production/endpoints/my-compute \ - "spec.autoscaling_limit_min_cu,spec.autoscaling_limit_max_cu" \ - --json '{"spec": {"autoscaling_limit_min_cu": 2.0, "autoscaling_limit_max_cu": 8.0}}' -``` - -## Deleting a Compute - -```python -w.postgres.delete_endpoint( - name="projects/my-app/branches/production/endpoints/my-compute" -).wait() -``` - -## Autoscaling - -Autoscaling dynamically adjusts compute resources based on workload demand. - -### Configuration - -- **Range:** 0.5-32 CU -- **Constraint:** Max - Min cannot exceed 8 CU -- **Valid examples:** 4-8 CU, 8-16 CU, 16-24 CU -- **Invalid example:** 0.5-32 CU (range of 31.5 CU) - -### Best Practices - -- Set minimum CU large enough to cache your working set in memory -- Performance may be degraded until compute scales up and caches data -- Connection limits are based on the maximum CU in the range - -## Scale-to-Zero - -Automatically suspends compute after a period of inactivity. - -| Setting | Description | -|---------|-------------| -| **Enabled** | Compute suspends after inactivity timeout (saves cost) | -| **Disabled** | Always-active compute (eliminates wake-up latency) | - -**Default behavior:** -- `production` branch: Scale-to-zero **disabled** (always active) -- Other branches: Scale-to-zero can be configured - -**Default inactivity timeout:** 5 minutes -**Minimum inactivity timeout:** 60 seconds - -### Wake-up Behavior - -When a connection arrives on a suspended compute: -1. Compute starts automatically (reactivation takes a few hundred milliseconds) -2. The connection request is handled transparently once active -3. Compute restarts at minimum autoscaling size (if autoscaling enabled) -4. Applications should implement connection retry logic for the brief reactivation period - -### Session Context After Reactivation - -When a compute suspends and reactivates, session context is **reset**: -- In-memory statistics and cache contents are cleared -- Temporary tables and prepared statements are lost -- Session-specific configuration settings reset -- Connection pools and active transactions are terminated - -If your application requires persistent session data, consider disabling scale-to-zero. - -## Sizing Guidance - -| Factor | Recommendation | -|--------|---------------| -| Query complexity | Complex analytical queries benefit from larger computes | -| Concurrent connections | More connections need more CPU and memory | -| Data volume | Larger datasets may need more memory for performance | -| Response time | Critical apps may require larger computes | diff --git a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md b/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md deleted file mode 100644 index 398862b3..00000000 --- a/databricks-skills/databricks-lakebase-autoscale/connection-patterns.md +++ /dev/null @@ -1,304 +0,0 @@ -# Lakebase Autoscaling Connection Patterns - -## Overview - -This document covers different connection patterns for Lakebase Autoscaling, from simple scripts to production applications with token refresh. - -## Authentication Methods - -Lakebase Autoscaling supports two authentication methods: - -| Method | Token Lifetime | Best For | -|--------|---------------|----------| -| **OAuth tokens** | 1 hour (must refresh) | Interactive sessions, workspace-integrated apps | -| **Native Postgres passwords** | No expiry | Long-running processes, tools without token rotation | - -**Connection timeouts (both methods):** -- **24-hour idle timeout**: Connections with no activity for 24 hours are automatically closed -- **3-day maximum connection life**: Connections alive for more than 3 days may be closed - -Design your applications to handle connection timeouts with retry logic. - -## Connection Methods - -### 1. Direct psycopg Connection (Simple Scripts) - -For one-off scripts or notebooks: - -```python -import psycopg -from databricks.sdk import WorkspaceClient - -def get_connection(project_id: str, branch_id: str = "production", - endpoint_id: str = None, database_name: str = "databricks_postgres"): - """Get a database connection with fresh OAuth token.""" - w = WorkspaceClient() - - # Get endpoint details to find the host - if endpoint_id: - ep_name = f"projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id}" - else: - # List endpoints and pick the primary R/W one - endpoints = list(w.postgres.list_endpoints( - parent=f"projects/{project_id}/branches/{branch_id}" - )) - ep_name = endpoints[0].name - - endpoint = w.postgres.get_endpoint(name=ep_name) - host = endpoint.status.hosts.host - - # Generate OAuth token (valid for 1 hour) - cred = w.postgres.generate_database_credential(endpoint=ep_name) - - # Build connection string - conn_string = ( - f"host={host} " - f"dbname={database_name} " - f"user={w.current_user.me().user_name} " - f"password={cred.token} " - f"sslmode=require" - ) - - return psycopg.connect(conn_string) - -# Usage -with get_connection("my-app") as conn: - with conn.cursor() as cur: - cur.execute("SELECT NOW()") - print(cur.fetchone()) -``` - -### 2. Connection Pool with Token Refresh (Production) - -For long-running applications that need connection pooling: - -```python -import asyncio -import uuid -from contextlib import asynccontextmanager -from typing import AsyncGenerator, Optional - -from sqlalchemy import event -from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker -from databricks.sdk import WorkspaceClient - - -class LakebaseAutoscaleConnectionManager: - """Manages Lakebase Autoscaling connections with automatic token refresh.""" - - def __init__( - self, - project_id: str, - branch_id: str = "production", - database_name: str = "databricks_postgres", - pool_size: int = 5, - max_overflow: int = 10, - token_refresh_seconds: int = 3000 # 50 minutes - ): - self.project_id = project_id - self.branch_id = branch_id - self.database_name = database_name - self.pool_size = pool_size - self.max_overflow = max_overflow - self.token_refresh_seconds = token_refresh_seconds - - self._current_token: Optional[str] = None - self._refresh_task: Optional[asyncio.Task] = None - self._engine = None - self._session_maker = None - - def _generate_token(self) -> str: - """Generate fresh OAuth token.""" - w = WorkspaceClient() - # Get primary endpoint name for token scoping - endpoints = list(w.postgres.list_endpoints( - parent=f"projects/{self.project_id}/branches/{self.branch_id}" - )) - endpoint_name = endpoints[0].name if endpoints else None - cred = w.postgres.generate_database_credential(endpoint=endpoint_name) - return cred.token - - def _get_host(self) -> str: - """Get the connection host from the primary endpoint.""" - w = WorkspaceClient() - endpoints = list(w.postgres.list_endpoints( - parent=f"projects/{self.project_id}/branches/{self.branch_id}" - )) - if not endpoints: - raise RuntimeError( - f"No endpoints found for projects/{self.project_id}/branches/{self.branch_id}" - ) - endpoint = w.postgres.get_endpoint(name=endpoints[0].name) - return endpoint.status.hosts.host - - async def _refresh_loop(self): - """Background task to refresh token periodically.""" - while True: - await asyncio.sleep(self.token_refresh_seconds) - try: - self._current_token = await asyncio.to_thread(self._generate_token) - except Exception as e: - print(f"Token refresh failed: {e}") - - def initialize(self): - """Initialize database engine and start token refresh.""" - w = WorkspaceClient() - - # Get host info - host = self._get_host() - username = w.current_user.me().user_name - - # Generate initial token - self._current_token = self._generate_token() - - # Create engine (password injected via event) - url = ( - f"postgresql+psycopg://{username}@" - f"{host}:5432/{self.database_name}" - ) - - self._engine = create_async_engine( - url, - pool_size=self.pool_size, - max_overflow=self.max_overflow, - pool_recycle=3600, - connect_args={"sslmode": "require"} - ) - - # Inject token on connect - @event.listens_for(self._engine.sync_engine, "do_connect") - def inject_token(dialect, conn_rec, cargs, cparams): - cparams["password"] = self._current_token - - self._session_maker = async_sessionmaker( - self._engine, - class_=AsyncSession, - expire_on_commit=False - ) - - def start_refresh(self): - """Start background token refresh task.""" - if not self._refresh_task: - self._refresh_task = asyncio.create_task(self._refresh_loop()) - - async def stop_refresh(self): - """Stop token refresh task.""" - if self._refresh_task: - self._refresh_task.cancel() - try: - await self._refresh_task - except asyncio.CancelledError: - pass - self._refresh_task = None - - @asynccontextmanager - async def session(self) -> AsyncGenerator[AsyncSession, None]: - """Get a database session.""" - async with self._session_maker() as session: - yield session - - async def close(self): - """Close all connections.""" - await self.stop_refresh() - if self._engine: - await self._engine.dispose() - - -# Usage in FastAPI -from fastapi import FastAPI - -app = FastAPI() -db_manager = LakebaseAutoscaleConnectionManager("my-app", "production", "my_database") - -@app.on_event("startup") -async def startup(): - db_manager.initialize() - db_manager.start_refresh() - -@app.on_event("shutdown") -async def shutdown(): - await db_manager.close() - -@app.get("/data") -async def get_data(): - async with db_manager.session() as session: - result = await session.execute("SELECT * FROM my_table") - return result.fetchall() -``` - -### 3. Static URL Mode (Local Development) - -For local development, use a static connection URL: - -```python -import os -from sqlalchemy.ext.asyncio import create_async_engine - -# Set environment variable with full connection URL -# LAKEBASE_PG_URL=postgresql://user:password@host:5432/database - -def get_database_url() -> str: - """Get database URL from environment.""" - url = os.environ.get("LAKEBASE_PG_URL") - if url and url.startswith("postgresql://"): - # Convert to psycopg3 async driver - url = url.replace("postgresql://", "postgresql+psycopg://", 1) - return url - -engine = create_async_engine( - get_database_url(), - pool_size=5, - connect_args={"sslmode": "require"} -) -``` - -### 4. DNS Resolution Workaround (macOS) - -Python's `socket.getaddrinfo()` fails with long hostnames on macOS. Use `dig` as fallback: - -```python -import subprocess -import socket - -def resolve_hostname(hostname: str) -> str: - """Resolve hostname using dig command (macOS workaround).""" - try: - return socket.gethostbyname(hostname) - except socket.gaierror: - pass - - try: - result = subprocess.run( - ["dig", "+short", hostname], - capture_output=True, text=True, timeout=5 - ) - ips = result.stdout.strip().split('\n') - for ip in ips: - if ip and not ip.startswith(';'): - return ip - except Exception: - pass - - raise RuntimeError(f"Could not resolve hostname: {hostname}") - -# Use with psycopg -conn_params = { - "host": hostname, # For TLS SNI - "hostaddr": resolve_hostname(hostname), # Actual IP - "dbname": database_name, - "user": username, - "password": token, - "sslmode": "require" -} -conn = psycopg.connect(**conn_params) -``` - -## Best Practices - -1. **Always use SSL**: Set `sslmode=require` in all connections -2. **Implement token refresh**: Tokens expire after 1 hour; refresh at 50 minutes -3. **Use connection pooling**: Avoid creating new connections per request -4. **Handle DNS issues on macOS**: Use the `hostaddr` workaround if needed -5. **Close connections properly**: Use context managers or explicit cleanup -6. **Handle scale-to-zero wake-up**: First connection after idle may take 2-5 seconds -7. **Log token refresh events**: Helps debug authentication issues diff --git a/databricks-skills/databricks-lakebase-autoscale/connections.md b/databricks-skills/databricks-lakebase-autoscale/connections.md new file mode 100644 index 00000000..0831a788 --- /dev/null +++ b/databricks-skills/databricks-lakebase-autoscale/connections.md @@ -0,0 +1,212 @@ +# Lakebase Autoscaling connection patterns + +Order of preference: + +1. **Canonical:** `psycopg_pool.ConnectionPool` + `OAuthConnection` subclass + `max_lifetime=2700`. +2. **SQLAlchemy:** official `do_connect` auth hook; optionally rely on `pool_recycle`/`dispose()` rather than a background token loop. +3. **Direct `psycopg.connect`:** notebooks/one-shot scripts under 1 hour. +4. **Static Postgres URL/native password:** local/dev tools only, or tools unable to rotate OAuth credentials. + +## Authentication facts + +Lakebase OAuth database credentials: +- Mint with `WorkspaceClient().postgres.generate_database_credential(endpoint=...)`. +- Use `cred.token` as the Postgres password. +- Expire after about 1 hour. +- Expiry is enforced at login; already-open connections continue until closed by pool/platform timeouts. + +Critical warning: + +```python +# ✅ Lakebase-scoped credential: works for Postgres login +cred = w.postgres.generate_database_credential(endpoint=endpoint_name) +password = cred.token + +# ❌ Workspace-scoped token: fails at Postgres login +password = w.config.oauth_token().access_token +# also do not use WorkspaceClient().config.token +``` + +Always connect with `sslmode=require`. + +## 1. Canonical: psycopg pool + OAuthConnection + +Use for production Databricks Apps and most Python services. + +Key mechanics: +- The pool calls `OAuthConnection.connect()` whenever it opens a physical connection: initial fill, growth under load, recycle, replacement after failure. +- `connect()` mints a fresh Lakebase token just-in-time and injects it as `password`. +- `max_lifetime=2700` recycles physical connections after 45 minutes, before 1-hour token expiry. +- No background refresh thread/task is needed. + +Minimal skeleton: + +```python +import os +import psycopg +from psycopg_pool import ConnectionPool +from databricks.sdk import WorkspaceClient + +w = WorkspaceClient() + +class OAuthConnection(psycopg.Connection): + @classmethod + def connect(cls, conninfo="", **kwargs): + cred = w.postgres.generate_database_credential( + endpoint=os.environ["ENDPOINT_NAME"] + ) + kwargs["password"] = cred.token + return super().connect(conninfo, **kwargs) + +pool = ConnectionPool( + conninfo=( + f"dbname={os.environ['PGDATABASE']} " + f"user={os.environ['PGUSER']} " + f"host={os.environ['PGHOST']} " + f"port={os.environ.get('PGPORT', '5432')} " + f"sslmode={os.environ.get('PGSSLMODE', 'require')}" + ), + connection_class=OAuthConnection, + min_size=1, + max_size=10, + max_lifetime=2700, + open=True, +) +``` + +Prefer `2700`; it is a defensive convention. The official Databricks tutorial leaves `max_lifetime` unset; `databricks-ai-bridge` uses `2700`. + +For FastAPI or explicit startup: +- instantiate with `open=False` +- call `pool.open(wait=True, timeout=30.0)` in lifespan/startup +- call `pool.close()` on shutdown + +This also avoids relying on implicit open behavior. + +## Databricks Apps environment variables + +When adding a Lakebase/Postgres resource to a Databricks App, these are auto-injected for the **first** DB resource: + +```text +PGAPPNAME +PGHOST +PGPORT +PGDATABASE +PGUSER +PGSSLMODE +``` + +Gotchas: +- `PGUSER` is typically the app service principal client ID. +- Only the first database resource is auto-injected; additional resources need explicit `valueFrom`. +- `ENDPOINT_NAME` is **not** auto-injected. Add it manually because `generate_database_credential(endpoint=...)` requires the full endpoint path: + +```yaml +env: + - name: ENDPOINT_NAME + value: "projects//branches//endpoints/" +``` + +## 2. SQLAlchemy: official `do_connect` hook + +Use when the app is already built around SQLAlchemy. + +Important distinction: +- `do_connect` is the official Databricks-recommended SQLAlchemy credential injection hook and is used by `databricks-ai-bridge`. +- The community/extra-complexity variant is adding a background `asyncio.Task` token-refresh loop. Demote that loop, not `do_connect`. + +Recommended hook shape: + +```python +from sqlalchemy import event +from sqlalchemy.ext.asyncio import create_async_engine +from databricks.sdk import WorkspaceClient + +w = WorkspaceClient() +endpoint_name = "projects/my-app/branches/production/endpoints/ep-primary" +host = w.postgres.get_endpoint(name=endpoint_name).status.hosts.host +user = w.current_user.me().user_name + +engine = create_async_engine( + f"postgresql+psycopg://{user}@{host}:5432/databricks_postgres", + connect_args={"sslmode": "require"}, + pool_recycle=2700, +) + +@event.listens_for(engine.sync_engine, "do_connect") +def inject_lakebase_token(dialect, conn_rec, cargs, cparams): + cred = w.postgres.generate_database_credential(endpoint=endpoint_name) + cparams["password"] = cred.token +``` + +Notes: +- `do_connect` fires when SQLAlchemy opens a new DBAPI connection. +- `pool_recycle=2700` approximates the psycopg-pool pattern. +- If you need deterministic refresh, prefer scheduled `engine.dispose()` and let the next checkout re-open with `do_connect`. +- A background token cache/refresh task is optional complexity and can create stale-token races if implemented poorly. + +## 3. Direct psycopg for notebooks/scripts + +Only for short-lived sessions where connections are opened and used immediately. + +Recipe: +1. Build endpoint path. +2. `get_endpoint(...).status.hosts.host`. +3. `generate_database_credential(endpoint=endpoint_name)`. +4. `psycopg.connect(host=host, dbname="databricks_postgres", user=, password=cred.token, sslmode="require")`. + +Use `w.current_user.me().user_name` for user in notebooks/manual scripts. In Databricks Apps, prefer `PGUSER`. + +## 4. Static URL / native password + +Use only for local development, legacy tools, or clients that cannot rotate OAuth database credentials. For SQLAlchemy + psycopg3, normalize: + +```text +postgresql://... -> postgresql+psycopg://... +``` + +Still set `sslmode=require`. + +## Endpoint discovery + +Avoid hardcoding host if you can hardcode the endpoint name instead: + +```python +ep = w.postgres.get_endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary" +) +host = ep.status.hosts.host +``` + +If no endpoint ID is known, list under branch and choose deliberately: + +```python +endpoints = list(w.postgres.list_endpoints( + parent="projects/my-app/branches/production" +)) +``` + +Do not assume the first endpoint is the primary if read replicas exist; check endpoint type/status. + +## DNS workaround for macOS + +Some macOS/Python resolver combinations fail on long Lakebase hostnames. + +Workaround: +- Resolve the hostname externally, commonly with `dig +short `. +- Pass both: + - `host=` for TLS/SNI/certificate validation. + - `hostaddr=` for the actual TCP connection. + +psycopg3 supports `hostaddr`. + +## Timeouts, scale-to-zero, and retries + +Plan for: +- 1-hour Lakebase OAuth token lifetime at login. +- 24-hour idle connection timeout. +- 3-day maximum connection lifetime. +- Scale-to-zero wake-up latency; first connection/query after suspension may need retry/backoff. +- After suspension/reactivation: session context is reset, temp tables/prepared statements are gone, active transactions/connections are terminated. + +Use context managers so pooled connections return promptly. diff --git a/databricks-skills/databricks-lakebase-autoscale/operations.md b/databricks-skills/databricks-lakebase-autoscale/operations.md new file mode 100644 index 00000000..982bfb58 --- /dev/null +++ b/databricks-skills/databricks-lakebase-autoscale/operations.md @@ -0,0 +1,297 @@ +# Lakebase Autoscaling operations + +Use `WorkspaceClient().postgres` for Autoscaling projects, branches, endpoints, roles, and credentials. Most create/update/delete methods return long-running operations; call `.wait()`. + +```python +from databricks.sdk import WorkspaceClient +w = WorkspaceClient() +``` + +## Resource names + +```text +Project: projects/{project_id} +Branch: projects/{project_id}/branches/{branch_id} +Endpoint: projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id} +``` + +Project ID rules: +- 1–63 chars +- lowercase letters, digits, hyphens +- cannot start/end with hyphen +- immutable after creation + +Default database: `databricks_postgres`. + +## Projects + +Create: + +```python +from databricks.sdk.service.postgres import Project, ProjectSpec + +project = w.postgres.create_project( + project=Project(spec=ProjectSpec(display_name="My App", pg_version="17")), + project_id="my-app", +).wait() +``` + +Project defaults: +- `production` branch +- primary read-write endpoint +- `databricks_postgres` database +- role for creator’s Databricks identity +- production scale-to-zero disabled by default + +GET gotcha: effective properties are typically in `project.status`, not `project.spec`. + +Update requires `FieldMask`: + +```python +from databricks.sdk.service.postgres import FieldMask + +w.postgres.update_project( + name="projects/my-app", + project=Project( + name="projects/my-app", + spec=ProjectSpec(display_name="New Name"), + ), + update_mask=FieldMask(field_mask=["spec.display_name"]), +).wait() +``` + +Delete is destructive and permanent; delete dependent Unity Catalog catalogs/synced tables first where applicable: + +```python +w.postgres.delete_project(name="projects/my-app").wait() +``` + +## Branches + +Branches are copy-on-write isolated database environments. Use them for dev/test/staging, schema-change validation, point-in-time recovery workflows, and ephemeral CI. + +Create branch from current parent: + +```python +from databricks.sdk.service.postgres import Branch, BranchSpec, Duration + +branch = w.postgres.create_branch( + parent="projects/my-app", + branch=Branch(spec=BranchSpec( + source_branch="projects/my-app/branches/production", + ttl=Duration(seconds=604800), # or no_expiry=True + )), + branch_id="development", +).wait() +``` + +Keep: +- `ttl=Duration(seconds=...)` for ephemeral branches. +- `no_expiry=True` for permanent branches. +- Max expiration period: 30 days from current time. +- Only 10 unarchived branches per project. +- Protected branches cannot be deleted, reset, archived, or expired. +- Default branch cannot be deleted or expired. +- Branches with children cannot be deleted, reset, or expired; delete children first. +- Reset replaces branch data/schema with latest parent and interrupts connections. + +Protect production: + +```python +w.postgres.update_branch( + name="projects/my-app/branches/production", + branch=Branch( + name="projects/my-app/branches/production", + spec=BranchSpec(is_protected=True), + ), + update_mask=FieldMask(field_mask=["spec.is_protected"]), +).wait() +``` + +Reset/delete: + +```python +w.postgres.reset_branch(name="projects/my-app/branches/development").wait() +w.postgres.delete_branch(name="projects/my-app/branches/development").wait() +``` + +Branch status fields worth inspecting: +- `status.default` +- `status.is_protected` +- `status.current_state` +- `status.logical_size_bytes` +- `status.expire_time` + +## Endpoints / computes + +A compute endpoint runs Postgres for a branch. Each branch has at most one primary read-write endpoint and may have read-only replica endpoints. + +Create endpoint: + +```python +from databricks.sdk.service.postgres import Endpoint, EndpointSpec, EndpointType + +ep = w.postgres.create_endpoint( + parent="projects/my-app/branches/production", + endpoint=Endpoint(spec=EndpointSpec( + endpoint_type=EndpointType.ENDPOINT_TYPE_READ_WRITE, + autoscaling_limit_min_cu=0.5, + autoscaling_limit_max_cu=4.0, + )), + endpoint_id="ep-primary", +).wait() +``` + +Get host: + +```python +host = w.postgres.get_endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary" +).status.hosts.host +``` + +Resize with update mask: + +```python +w.postgres.update_endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary", + endpoint=Endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary", + spec=EndpointSpec( + autoscaling_limit_min_cu=2.0, + autoscaling_limit_max_cu=8.0, + ), + ), + update_mask=FieldMask(field_mask=[ + "spec.autoscaling_limit_min_cu", + "spec.autoscaling_limit_max_cu", + ]), +).wait() +``` + +Delete: + +```python +w.postgres.delete_endpoint( + name="projects/my-app/branches/production/endpoints/ep-primary" +).wait() +``` + +## Compute sizing + +Autoscaling uses ~2 GB RAM per CU. + +| CU | Approx RAM | Max connections | +|---:|---:|---:| +| 0.5 | ~1 GB | 104 | +| 1 | ~2 GB | 209 | +| 4 | ~8 GB | 839 | +| 8 | ~16 GB | 1,678 | +| 16 | ~32 GB | 3,357 | +| 32 | ~64 GB | 4,000 | +| 64 | ~128 GB | 4,000 | +| 112 | ~224 GB | 4,000 | + +Rules: +- Autoscale range: 0.5–32 CU. +- `autoscaling_limit_max_cu - autoscaling_limit_min_cu <= 16`. +- Valid: 4–20, 8–16, 16–32. +- Invalid: 0.5–32 (spread of 31.5 exceeds 16). +- Fixed-size always-on computes: 40–112 CU; no autoscaling. +- Connection limit is based on max CU. +- Set min CU high enough for working-set cache and latency needs. + +## Scale-to-zero + +Defaults: +- `production`: disabled by default. +- Other branches: configurable. +- Default inactivity timeout: 5 minutes. +- Minimum inactivity timeout: 60 seconds. + +Wake-up: +- First connection wakes compute automatically. +- Apps should use retry/backoff for the brief reactivation period. +- Reactivated compute starts at minimum autoscaling size. + +Session reset after suspension: +- temp tables gone +- prepared statements gone +- in-memory stats/cache cleared +- session settings reset +- active transactions/connections terminated + +Disable scale-to-zero for latency-critical apps or apps relying on persistent session state. + +## Project limits + +| Resource | Limit | +|---|---:| +| Projects per workspace | 1000 | +| Branches per project | 500 | +| Unarchived branches | 10 | +| Root branches | 3 | +| Protected branches | 1 | +| Concurrently active computes | 20 | +| Postgres roles per branch | 500 | +| Postgres databases per branch | 500 | +| Logical data size per branch | 8 TB | +| Snapshots | 10 | +| Maximum history retention | 35 days | +| Minimum scale-to-zero time | 60 sec | + +## CLI names + +CLI mirrors the SDK under `databricks postgres`, for example: +- `create-project`, `get-project`, `list-projects`, `update-project`, `delete-project` +- `create-branch`, `list-branches`, `reset-branch`, `delete-branch` +- `create-endpoint`, `get-endpoint`, `list-endpoints`, `update-endpoint`, `delete-endpoint` + +## MCP tools + +Use `type="autoscale"` for Lakebase Autoscaling. + +### `manage_lakebase_database` + +Actions: +- `create_or_update`: requires `name`; useful params include `display_name`, `pg_version` +- `get`: requires `name` +- `list`: optional type filter +- `delete`: requires `name` + +Example intent: + +```python +manage_lakebase_database( + action="create_or_update", + name="my-app", + type="autoscale", + display_name="My Application", + pg_version="17", +) +``` + +### `manage_lakebase_branch` + +Actions: +- `create_or_update`: requires `project_name`, `branch_id` +- `delete`: requires full branch `name` + +Useful params: +- `source_branch` +- `ttl_seconds` +- `autoscaling_limit_min_cu` +- `autoscaling_limit_max_cu` +- `scale_to_zero_seconds` + +### `generate_lakebase_credential` + +Generate a Lakebase-scoped database credential: + +```python +generate_lakebase_credential( + endpoint="projects/my-app/branches/production/endpoints/ep-primary" +) +``` + +Use returned token as the Postgres password with `sslmode=require`. diff --git a/databricks-skills/databricks-lakebase-autoscale/projects.md b/databricks-skills/databricks-lakebase-autoscale/projects.md deleted file mode 100644 index 659207a4..00000000 --- a/databricks-skills/databricks-lakebase-autoscale/projects.md +++ /dev/null @@ -1,204 +0,0 @@ -# Lakebase Autoscaling Projects - -## Overview - -A project is the top-level container for Lakebase Autoscaling resources, including branches, computes, databases, and roles. Each project is isolated and contains its own Postgres version, compute defaults, and restore window settings. - -## Project Structure - -``` -Project - └── Branches (production, development, staging, etc.) - ├── Computes (R/W compute, read replicas) - ├── Roles (Postgres roles) - └── Databases (Postgres databases) -``` - -When a project is created, it includes by default: -- A `production` branch (the default branch) -- A primary read-write compute (8-32 CU, autoscaling enabled, scale-to-zero disabled) -- A `databricks_postgres` database -- A Postgres role for the creating user's Databricks identity - -## Resource Naming - -Projects follow a hierarchical naming convention: -``` -projects/{project_id} -``` - -**Resource ID requirements:** -- 1-63 characters long -- Lowercase letters, digits, and hyphens only -- Cannot start or end with a hyphen -- Cannot be changed after creation - -## Creating a Project - -### Python SDK - -```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.postgres import Project, ProjectSpec - -w = WorkspaceClient() - -# Create a project (long-running operation) -operation = w.postgres.create_project( - project=Project( - spec=ProjectSpec( - display_name="My Application", - pg_version="17" - ) - ), - project_id="my-app" -) - -# Wait for completion -result = operation.wait() -print(f"Created project: {result.name}") -print(f"Display name: {result.status.display_name}") -print(f"Postgres version: {result.status.pg_version}") -``` - -### CLI - -```bash -databricks postgres create-project \ - --project-id my-app \ - --json '{ - "spec": { - "display_name": "My Application", - "pg_version": "17" - } - }' -``` - -## Getting Project Details - -### Python SDK - -```python -project = w.postgres.get_project(name="projects/my-app") - -print(f"Project: {project.name}") -print(f"Display name: {project.status.display_name}") -print(f"Postgres version: {project.status.pg_version}") -``` - -### CLI - -```bash -databricks postgres get-project projects/my-app -``` - -**Note:** The `spec` field is not populated for GET operations. All properties are returned in the `status` field. - -## Listing Projects - -```python -projects = w.postgres.list_projects() - -for project in projects: - print(f"Project: {project.name}") - print(f" Display name: {project.status.display_name}") - print(f" Postgres version: {project.status.pg_version}") -``` - -## Updating a Project - -Updates require an `update_mask` specifying which fields to modify: - -```python -from databricks.sdk.service.postgres import Project, ProjectSpec, FieldMask - -# Update display name -operation = w.postgres.update_project( - name="projects/my-app", - project=Project( - name="projects/my-app", - spec=ProjectSpec( - display_name="My Updated Application" - ) - ), - update_mask=FieldMask(field_mask=["spec.display_name"]) -) -result = operation.wait() -``` - -### CLI - -```bash -databricks postgres update-project projects/my-app spec.display_name \ - --json '{ - "spec": { - "display_name": "My Updated Application" - } - }' -``` - -## Deleting a Project - -**WARNING:** Deleting a project is permanent and also deletes all branches, computes, databases, roles, and data. - -Delete all Unity Catalog catalogs and synced tables before deleting the project. - -```python -operation = w.postgres.delete_project(name="projects/my-app") -# This is a long-running operation -``` - -### CLI - -```bash -databricks postgres delete-project projects/my-app -``` - -## Project Settings - -### Compute Defaults - -Default settings for new primary computes: -- Compute size range (0.5-112 CU) -- Scale-to-zero timeout (default: 5 minutes) - -### Instant Restore - -Configure the restore window length (2-35 days). Longer windows increase storage costs. - -### Postgres Version - -Supports Postgres 16 and Postgres 17. - -## Project Limits - -| Resource | Limit | -|----------|-------| -| Concurrently active computes | 20 | -| Branches per project | 500 | -| Postgres roles per branch | 500 | -| Postgres databases per branch | 500 | -| Logical data size per branch | 8 TB | -| Projects per workspace | 1000 | -| Protected branches | 1 | -| Root branches | 3 | -| Unarchived branches | 10 | -| Snapshots | 10 | -| Maximum history retention | 35 days | -| Minimum scale-to-zero time | 60 seconds | - -## Long-Running Operations - -All create, update, and delete operations return a long-running operation (LRO). Use `.wait()` in the SDK to block until completion: - -```python -# Start operation -operation = w.postgres.create_project(...) - -# Wait for completion -result = operation.wait() - -# Or check status manually -op_status = w.postgres.get_operation(name=operation.name) -print(f"Done: {op_status.done}") -``` diff --git a/databricks-skills/databricks-lakebase-autoscale/reverse-etl.md b/databricks-skills/databricks-lakebase-autoscale/reverse-etl.md index f983eebb..949f91b6 100644 --- a/databricks-skills/databricks-lakebase-autoscale/reverse-etl.md +++ b/databricks-skills/databricks-lakebase-autoscale/reverse-etl.md @@ -1,56 +1,59 @@ -# Reverse ETL with Lakebase Autoscaling +# Reverse ETL / synced tables -## Overview +Reverse ETL syncs Unity Catalog Delta tables into Lakebase Autoscaling as PostgreSQL tables for OLTP access. -Reverse ETL allows you to sync data from Unity Catalog Delta tables into Lakebase Autoscaling as PostgreSQL tables. This enables OLTP access patterns on data processed in the Lakehouse. +Important namespace split: +- Lakebase Autoscaling infrastructure: `w.postgres` +- Synced tables: `w.database` -## How It Works +Reverse ETL is Delta-to-Postgres only; Postgres-to-Delta sync is not supported here. -Synced tables create a managed copy of Unity Catalog data in Lakebase: +## How synced tables work -1. A new Unity Catalog table (read-only, managed by the sync pipeline) -2. A Postgres table in Lakebase (queryable by applications) +A synced table creates/maintains: +1. A managed/read-only Unity Catalog table for pipeline state/output. +2. A PostgreSQL table in Lakebase queried by apps. -The sync pipeline uses managed Lakeflow Spark Declarative Pipelines to continuously update both tables. +The sync pipeline uses managed Lakeflow Spark Declarative Pipelines. -### Performance +Performance planning: +- Continuous writes: ~1,200 rows/sec per CU. +- Bulk writes: ~15,000 rows/sec per CU. +- Each synced table can use up to 16 Postgres connections. -- **Continuous writes:** ~1,200 rows/sec per CU -- **Bulk writes:** ~15,000 rows/sec per CU -- **Connections used:** Up to 16 per synced table +## Sync modes -## Sync Modes +| Mode | Behavior | Use when | CDF required | +|---|---|---|---| +| `SNAPSHOT` | one-time full copy | initial loads, historical copy, large replacement | no | +| `TRIGGERED` | scheduled/on-demand incremental updates | hourly/daily operational refresh | yes | +| `CONTINUOUS` | streaming updates, seconds latency | live applications | yes | -| Mode | Description | Best For | Notes | -|------|-------------|----------|-------| -| **Snapshot** | One-time full copy | Initial setup, historical analysis | 10x more efficient if modifying >10% of data | -| **Triggered** | Scheduled updates on demand | Dashboards updated hourly/daily | Requires CDF on source table | -| **Continuous** | Real-time streaming (seconds of latency) | Live applications | Highest cost, minimum 15s intervals, requires CDF | - -**Note:** Triggered and Continuous modes require Change Data Feed (CDF) enabled on the source table: +Triggered and Continuous require Delta Change Data Feed on the source table: ```sql -ALTER TABLE your_catalog.your_schema.your_table -SET TBLPROPERTIES (delta.enableChangeDataFeed = true) +ALTER TABLE catalog.schema.table +SET TBLPROPERTIES (delta.enableChangeDataFeed = true); ``` -## Creating Synced Tables +Snapshot can be more efficient when modifying >10% of the data. + +## Create a synced table -### Using Python SDK +Use `databricks.sdk.service.database` models: ```python from databricks.sdk import WorkspaceClient from databricks.sdk.service.database import ( - SyncedDatabaseTable, - SyncedTableSpec, NewPipelineSpec, + SyncedDatabaseTable, SyncedTableSchedulingPolicy, + SyncedTableSpec, ) w = WorkspaceClient() -# Create a synced table -synced_table = w.database.create_synced_database_table( +w.database.create_synced_database_table( SyncedDatabaseTable( name="lakebase_catalog.schema.synced_table", spec=SyncedTableSpec( @@ -59,55 +62,35 @@ synced_table = w.database.create_synced_database_table( scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED, new_pipeline_spec=NewPipelineSpec( storage_catalog="lakebase_catalog", - storage_schema="staging" - ) + storage_schema="staging", + ), ), ) ) -print(f"Created synced table: {synced_table.name}") ``` -### Using CLI - -```bash -databricks database create-synced-database-table \ - --json '{ - "name": "lakebase_catalog.schema.synced_table", - "spec": { - "source_table_full_name": "analytics.gold.user_profiles", - "primary_key_columns": ["user_id"], - "scheduling_policy": "TRIGGERED", - "new_pipeline_spec": { - "storage_catalog": "lakebase_catalog", - "storage_schema": "staging" - } - } - }' -``` - -## Checking Synced Table Status +Status: ```python -status = w.database.get_synced_database_table(name="lakebase_catalog.schema.synced_table") -print(f"State: {status.data_synchronization_status.detailed_state}") -print(f"Message: {status.data_synchronization_status.message}") +st = w.database.get_synced_database_table( + name="lakebase_catalog.schema.synced_table" +) +state = st.data_synchronization_status.detailed_state +message = st.data_synchronization_status.message ``` -## Deleting a Synced Table - -Delete from both Unity Catalog and Postgres: - -1. **Unity Catalog:** Delete from Catalog Explorer or SDK -2. **Postgres:** Drop the table to free storage +Deletion cleanup: +1. Delete the synced table / UC object. +2. Drop the Postgres target table if needed to free Lakebase storage. ```sql -DROP TABLE your_database.your_schema.your_table; +DROP TABLE schema.table; ``` -## Data Type Mapping +## Type mapping -| Unity Catalog Type | Postgres Type | -|-------------------|---------------| +| Unity Catalog | Postgres | +|---|---| | BIGINT | BIGINT | | BINARY | BYTEA | | BOOLEAN | BOOLEAN | @@ -126,52 +109,19 @@ DROP TABLE your_database.your_schema.your_table; | MAP | JSONB | | STRUCT | JSONB | -**Unsupported types:** GEOGRAPHY, GEOMETRY, VARIANT, OBJECT - -## Capacity Planning - -- **Connection usage:** Each synced table uses up to 16 connections -- **Size limits:** 2 TB total across all synced tables; recommend < 1 TB per table -- **Naming:** Database, schema, and table names only allow `[A-Za-z0-9_]+` -- **Schema evolution:** Only additive changes (e.g., adding columns) for Triggered/Continuous modes - -## Use Cases - -### Product Catalog for Web App - -```python -w.database.create_synced_database_table( - SyncedDatabaseTable( - name="ecommerce_catalog.public.products", - spec=SyncedTableSpec( - source_table_full_name="gold.products.catalog", - primary_key_columns=["product_id"], - scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED, - ), - ) -) -``` - -### Real-time Feature Serving - -```python -w.database.create_synced_database_table( - SyncedDatabaseTable( - name="ml_catalog.public.user_features", - spec=SyncedTableSpec( - source_table_full_name="ml.features.user_features", - primary_key_columns=["user_id"], - scheduling_policy=SyncedTableSchedulingPolicy.CONTINUOUS, - ), - ) -) -``` - -## Best Practices - -1. **Enable CDF** on source tables before creating Triggered or Continuous synced tables -2. **Choose appropriate sync mode**: Snapshot for small tables, Triggered for hourly/daily, Continuous for real-time -3. **Monitor sync status**: Check for failures and latency via Catalog Explorer -4. **Index target tables**: Create appropriate indexes in Postgres for your query patterns -5. **Handle schema changes**: Only additive changes are supported for streaming modes -6. **Account for connection limits**: Each synced table uses up to 16 connections +Unsupported: +- `GEOGRAPHY` +- `GEOMETRY` +- `VARIANT` +- `OBJECT` + +## Limits and gotchas + +- Up to 16 Postgres connections per synced table; include this in endpoint connection-capacity planning. +- Size limit: 2 TB total across all synced tables. +- Recommended: <1 TB per synced table. +- Database/schema/table names: `[A-Za-z0-9_]+`. +- Triggered/Continuous schema evolution: additive changes only. +- Create indexes in Postgres for application query patterns after sync. +- Monitor detailed sync state in Catalog Explorer or with `get_synced_database_table`. +- Delete synced-table dependencies before deleting the Lakebase project.