Skip to content

Improve Lakebase skill: synced tables docs, apps decision gate, and connectivity#56

Merged
pkosiec merged 15 commits intomainfrom
pkosiec/lakebase-synced-tables
Apr 30, 2026
Merged

Improve Lakebase skill: synced tables docs, apps decision gate, and connectivity#56
pkosiec merged 15 commits intomainfrom
pkosiec/lakebase-synced-tables

Conversation

@pkosiec
Copy link
Copy Markdown
Member

@pkosiec pkosiec commented Apr 23, 2026

Summary

  • Synced tables docs: Replace reverse-etl.md with comprehensive synced-tables.md covering CLI syntax, sync modes, prerequisites, constraints, and worked examples
  • Apps decision gate: Add Step 0 to Development Workflow — presents Synced Tables vs Analytics comparison to the user before scaffolding any app that reads UC data
  • Lakebase SKILL.md: Add psql workflow, GRANT commands for app SPs, expanded troubleshooting
  • Connectivity: Add Python app connection pattern, DNS workaround
  • Serverless migration: New skill (Add databricks-serverless-migration skill #55)

Screenshots

Case 1:

Screenshot 2026-04-27 at 13 29 10

Case 2:

Screenshot 2026-04-27 at 13 29 11

Sample prompts

I want to build a Databricks App for our inventory team to quickly search parts by part key and see supplier
  info, retail price, and availability. Data is in Unity Catalog (samples.tpch.part and samples.tpch.partsupp).

  Prefix all created resources with pktest3.

  Can you build this end to end?
I want to build a Databricks App that shows sales trends by region. I have    
order data in Unity Catalog                                                     
  (samples.tpch.orders and samples.tpch.customer).                              
                                                                                
  Show charts of revenue by nation and order priority breakdown.                
                                                                                
  Prefix all created resources with pktest4.

Test plan

  • Agent presents comparison table for UC-reading apps and waits for user choice
  • Agent skips gate for non-UC apps (CRUD, Genie, Model Serving)
  • python3 scripts/skills.py validate passes

@pkosiec pkosiec force-pushed the pkosiec/lakebase-synced-tables branch 3 times, most recently from cfe90bc to 63dee0b Compare April 24, 2026 15:13
@pkosiec pkosiec changed the title Improve Lakebase skill: synced tables, connectivity, and best-practice alignment Improve Lakebase skill: synced tables docs, apps decision gate, and connectivity Apr 27, 2026
@pkosiec pkosiec marked this pull request as ready for review April 27, 2026 12:18
@pkosiec pkosiec requested review from a team, lennartkats-db and simonfaltum as code owners April 27, 2026 12:18
@jamesbroadhead
Copy link
Copy Markdown

Stacked a follow-up at #59 — adds the SP-role-creation block for AppKit Lakebase apps (the case where the SP fails with password authentication failed on first deploy because no Postgres role exists yet). Branched off your branch so the diff stays focused; happy to fold it in if you'd rather have it as part of this PR.

Copy link
Copy Markdown
Contributor

@keugenek keugenek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix 'synced tables' term to be more specific before merging

Comment thread scripts/skills.py
},
"databricks-apps": {
"description": "Databricks Apps development and deployment",
"description": "Databricks Apps development and deployment (evaluates analytics vs synced tables data access)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkosiec is it clear enough for AI when you say 'synced tables' - maybe there is more desriptive term which is more specific?

Copy link
Copy Markdown
Member Author

@pkosiec pkosiec Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately I don't think so, it's the official name: https://docs.databricks.com/aws/en/oltp/instances/sync-data/sync-table - I will use "Lakebase synced tables" then.

From my testing an agent wasn't confused with the term but I agree, let's make it more clear 👍

Comment thread skills/databricks-lakebase/references/connectivity.md Outdated
Comment thread skills/databricks-lakebase/references/synced-tables.md
Comment thread manifest.json Outdated
@@ -1,12 +1,12 @@
{
"version": "2",
"updated_at": "2026-04-22T15:52:03Z",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these timings instead better be derived from git not from the contents

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Looked into deriving timestamps from git — the issue is that git log only knows about committed state, so you'd need a two-step workflow: commit skill changes first, then run the script to regenerate manifest.json, then commit again (or amend). That's friction for every skill edit.

Current st_mtime approach isn't perfect (varies across machines after clone/rebase), but validate already strips updated_at from comparison, so they're effectively cosmetic. The trade-off doesn't seem worth the workflow change right now 🤔

@keugenek
Copy link
Copy Markdown
Contributor

Nitpick Review — databricks/databricks-agent-skills PR #56

PR: #56 — Improve Lakebase skill: synced tables docs, apps decision gate, and connectivity
Author: pkosiec (Pawel Kosiec)
Base → Head: mainpkosiec/lakebase-synced-tables
Diff: 10 files, +463 / −321
Verdict:Changes requested — three blocking regressions in build/manifest tooling, plus design and content issues that should be addressed before merge.

Files reviewed (10)

File Type
manifest.json config
scripts/skills.py python
skills/databricks-apps/SKILL.md docs
skills/databricks-apps/references/appkit/jobs.md deleted (-141)
skills/databricks-apps/references/appkit/lakebase.md docs
skills/databricks-apps/references/appkit/overview.md docs
skills/databricks-lakebase/SKILL.md docs (major rewrite)
skills/databricks-lakebase/references/connectivity.md docs
skills/databricks-lakebase/references/reverse-etl.md deleted (-103)
skills/databricks-lakebase/references/synced-tables.md new (+214)

Reviewed by three parallel agents: Security, Correctness, Design.


🔴 Must Fix (CRITICAL / HIGH)

1. scripts/skills.pyiter_skill_files filter was removed without replacement

  • Severity: HIGH (Correctness CRITICAL · Design MEDIUM · Security LOW — all three reviewers flagged this)
  • Where: scripts/skills.py:98 (get_skill_updated_at) and :183 (generate_manifest)
  • Issue: The deleted helper filtered out dot-prefixed paths (.DS_Store, .git), __pycache__/, and *.pyc. The replacement uses bare rglob("*") at both call sites. On any macOS dev machine, Finder will drop .DS_Store into skill directories and they will (a) get added to the manifest's files list, (b) bump updated_at non-deterministically, and (c) potentially expose accidentally-committed dotfiles (e.g. .env) in the published manifest.
  • Fix: Restore the filter inline or via a helper:
    if any(part.startswith('.') for part in f.relative_to(skill_dir).parts):
        continue
    if f.suffix == '.pyc' or '__pycache__' in f.parts:
        continue

2. manifest.jsonupdated_at timestamps rolled backward across multiple skills

  • Severity: HIGH (Correctness HIGH · Design HIGH)
  • Where: Top-level and at least 6 skills (databricks-apps 2026-04-28 → 2026-04-27; databricks-core, databricks-dabs, databricks-jobs, databricks-model-serving, databricks-pipelines all 2026-04-28 → 2026-04-23)
  • Issue: The manifest appears to have been regenerated against a stale checkout, so skills not changed in this PR now claim to be older than they were on main. CI's validate mode strips updated_at before comparison, so it won't catch this. Any consumer that uses these timestamps for cache-busting will incorrectly believe the skills are stale.
  • Fix: Regenerate manifest.json against the actual branch HEAD (after fixing finding Refactor: move skills to skills/ directory #1). Confirm no unrelated updated_at values move backward versus main.

3. scripts/skills.py:47–50databricks-proto-first entry added but skill directory does not exist

  • Severity: HIGH (Correctness HIGH · Design MEDIUM)
  • Where: SKILL_METADATA dict
  • Issue: New entry "databricks-proto-first" has no matching skills/databricks-proto-first/ directory. The generator only validates the on-disk → metadata direction (raises if a directory has no metadata) but not metadata → on-disk, so the dead entry is silently dropped today. This will cause confusion when the skill is added (silent conflict or surprising activation) and is misleading to future contributors.
  • Fix: Either commit the skills/databricks-proto-first/ directory in this PR or remove the dict entry. Add an assertion that every SKILL_METADATA key resolves to an existing directory.

4. skills/databricks-apps/SKILL.md Decision Gate placement

  • Severity: HIGH (Design)
  • Where: SKILL.md:60–76
  • Issue: The new Decision Gate sits inside the "Development Workflow" section, after the "Required Reading by Phase" table and "Generic Guidelines". An LLM that reads the skill top-down will hit scaffolding instructions (~line 20) before encountering the gate (line 60), defeating the gate's purpose.
  • Fix: Promote the Decision Gate to a top-level section directly after the skill description, or at minimum add an inline **⚠️ Decision Gate required — see below** to the "Scaffolding" row of the Required Reading table.

5. skills/databricks-apps/references/appkit/lakebase.mdsynced-tables.md content duplicated

  • Severity: HIGH (Design)
  • Where: appkit/lakebase.md:180–193 vs databricks-lakebase/references/synced-tables.md (whole file)
  • Issue: The "Use synced tables when / Do NOT use" guidance is duplicated in two skills and will drift. Single source of truth should be the lakebase skill.
  • Fix: In appkit/lakebase.md, replace the duplicated lists with a one-line pointer ("see databricks-lakebase/references/synced-tables.md for full guidance") and keep only the AppKit-integration-specific notes.

🟡 Should Fix (MEDIUM)

  • [MEDIUM/security] connectivity.md:155–159resolve_host() passes raw dig +short first line to hostaddr without IP validation; CNAME, warning text, or empty output silently flows through. Fix: validate with socket.inet_pton(...); raise ValueError otherwise.
  • [MEDIUM/security] connectivity.md:84 — Pattern 3 promotes export LAKEBASE_PG_URL="postgresql://user:password@..." (creds in shell history/process listing) without a clear "local-dev only / use .env excluded from VCS / don't put real prod creds here" callout. Fix: add an explicit warning + .env/pg_service.conf alternative.
  • [MEDIUM/correctness] databricks-lakebase/SKILL.md:233 — Bash snippet parses CLI JSON via ['status']['hosts']['host'] while Python examples (connectivity.md:22, 33) use the SDK attribute path. The CLI JSON shape isn't verified in the docs; if hosts is an array or named differently, the snippet fails silently. Fix: verify against databricks postgres get-endpoint -o json and add a comment showing the expected shape.
  • [MEDIUM/correctness] databricks-apps/SKILL.md "When to Use What" routing table — has rows for Analytics, Lakebase CRUD, Genie, Model Serving but no row for Synced Tables / low-latency UC reads, which the new Decision Gate routes to. Fix: add a bullet pointing low-latency UC reads at --features lakebase + databricks-lakebase skill.
  • [MEDIUM/design] appkit/lakebase.md:228–230 — two consecutive callout blocks repeat the same instruction ("use databricks postgres create-synced-table, see lakebase skill, grant SP"). Fix: delete one.
  • [MEDIUM/design] appkit/overview.md:11–16 — two --features lakebase rows (Synced Tables vs Lakebase OLTP) with identical Init command but different --set requirements. An LLM cannot disambiguate them from the table alone. Fix: include the full init command (with required --set flags) in each row, or annotate that OLTP requires extra flags.

🟢 Consider (LOW / NIT)

  • [LOW/security] appkit/lakebase.md:193 — UC FGAC bypass for synced tables is buried in a "Do NOT use" bullet list. Promote to a dedicated > **Security note** callout in the "How It Works" section.
  • [LOW/security] databricks-lakebase/SKILL.md:241–243 — The example GRANT SELECT ON ALL TABLES IN SCHEMA public … ALTER DEFAULT PRIVILEGES grants the SP read on every current and future object in public, which is broader than most users likely intend. Recommend a dedicated schema for synced tables.
  • [LOW/correctness] appkit/lakebase.md:463, :484 — references the lakebase SKILL.md "Grant app SP access to synced tables" section, but that text isn't an ##/### heading — it's a bold label inside "Other Workflows". Anchor links won't resolve. Fix: promote to a real heading.
  • [LOW/design] databricks-lakebase/SKILL.md:3 — frontmatter description is ~253 chars, ~4× the length of sibling skills (databricks-jobs ~85, databricks-pipelines ~90). Trim to a single sentence; routing models don't benefit from prose.
  • [LOW/design] databricks-apps/SKILL.md:72 — introduces the novel pattern *Agent notes (do not show to user):* with no precedent or convention defined elsewhere. Establish the convention (scope, who honors it) or drop it.
  • [NIT/design] Deletion of references/appkit/jobs.md (141 lines) leaves no in-skill answer to "trigger/monitor a Lakeflow job from my app". Add a 10-line stub or a one-line pointer in overview.md to npx @databricks/appkit docs Jobs plugin + the databricks-jobs skill.
  • [NIT/design] databricks-lakebase/SKILL.md:28 says "Synced tables (reverse ETL)" while synced-tables.md:5 correctly says "Previously known as Reverse ETL." Use the deprecated-name framing in both places, or drop the parenthetical from SKILL.md.

What was checked and found clean

  • No hardcoded credentials, no eval/exec/shell=True, no yaml.load without SafeLoader, no pickle, no sslmode=disable patterns.
  • subprocess.run calls in scripts/skills.py use list form and don't incorporate user input.
  • All files listed in manifest.json exist on disk; no phantom entries for deleted jobs.md / reverse-etl.md.
  • No remaining markdown links to the deleted reference files.
  • Cross-skill relative path ../../../databricks-lakebase/references/synced-tables.md (appkit/lakebase.md:228) resolves on disk.
  • SQL in synced-tables.md examples is valid Postgres syntax; CLI commands use the new databricks postgres create-synced-table (consistent with the Autoscaling context).
  • Autoscaling Max−Min ≤ 16 CU constraint, removed from SKILL.md, is preserved in references/computes-and-scaling.md.
  • Decision Gate's A/B conditions are mutually exclusive; the "skip for pure CRUD" escape hatch is logically sound.

Reviewed in parallel by Security · Correctness · Design agents (Sonnet 4.6) on 2026-04-30.
Diff: 1103 lines, 55KB, against origin/main.

pkosiec and others added 14 commits April 30, 2026 12:43
- Add Pattern 4 (Python Databricks App with platform-injected env vars) to connectivity.md
- Clarify connection patterns are Python-specific; JS/AppKit apps get full auto-injection
- Remove duplicate Data API / Reverse ETL pointers already covered in reference docs block
- Fix missing blank lines before section headers in SKILL.md

Co-authored-by: Isaac
- Rename reverse-etl.md to synced-tables.md (official Databricks terminology)
- Replace all `databricks database` commands with `databricks postgres`
  (old Provisioned API → new Autoscaling API)
- Add `new_pipeline_spec` as required field with warning: storage_catalog
  must be a regular UC catalog, not the Lakebase catalog (DLT can't create
  event logs in Postgres-backed schemas)
- Add Prerequisites section for `create-catalog`
- Add complete field reference table
- Add NYC taxi example (end-to-end lifecycle)
- Add synced tables section to apps lakebase.md (read-only pattern,
  permission grants, differences from CRUD tables)
- All commands verified against live Lakebase project

Co-authored-by: Isaac
- Fix broken relative path in appkit/lakebase.md (../../ → ../../../)
- Trim duplicate autoscaling section in SKILL.md (defer to computes-and-scaling.md)
- Restore Data API and Synced Tables links in Other Workflows section
- Add "Previously known as Reverse ETL" to synced-tables.md for discoverability

Co-authored-by: Isaac
…sync troubleshooting

- Expand SKILL.md description with trigger words (synced tables, reverse ETL,
  Data API, scale-to-zero, OAuth, connection pooling) for better skill activation
- Switch to 3rd person voice per create-skill best practices
- Replace feature status table with bulleted Capabilities section
- Add 3 synced-table troubleshooting entries (storage_catalog, CDF, permissions)
- Update manifest description to match

Co-authored-by: Isaac
- Add DNS resolution (macOS) workaround to connectivity.md (was referenced from SKILL.md but missing)
- Rewrite synced table read example as tRPC route in appkit/lakebase.md (was Express-style, contradicting tRPC guidance)
- Deduplicate create-catalog command in synced-tables.md NYC taxi example
- Remove redundant storage_catalog warning callout (already in field table)
- Clarify that delete-synced-table leaves the Postgres table behind

Co-authored-by: Isaac
Add use-case guidance to help AI assistants discover synced tables for
the right patterns: operational consoles, user-facing apps on analytical
data, online feature serving, hybrid read/write, and Postgres-specific
capabilities. Include anti-patterns (OLAP, writes, huge tables, FGAC).

- Broaden Lakebase row in decision table to mention synced data
- Add callout below table pointing to synced tables section
- Add architecture pattern, use cases, and anti-patterns

Co-authored-by: Isaac
Add a bullet for "Read lakehouse data with low latency" that points
agents to the Lakebase guide when apps need fast point lookups. Add
guidance note for agents to ask about latency requirements rather than
defaulting to DBSQL for all read-from-UC use cases.

Co-authored-by: Isaac
Agent testing showed the "When to Use What" section routes to analytics
by default, even when the user explicitly asks for "fast" or "instant"
reads. The synced tables bullet was buried after three analytics bullets
that matched first.

- Add decision gate at top of section: if user mentions speed/latency,
  present both analytics and synced tables options before proceeding
- Rewrite guidance note to reference the gate instead of defaulting
- Remove redundant standalone synced tables bullet (covered by gate)

Co-authored-by: Isaac
…idance

From agent testing:
- Add psql connection workflow (generate-credential + psql) for running
  GRANT, CREATE INDEX, etc. against Lakebase
- Add post-deploy GRANT workflow for app SP access to synced tables
- Add source table guidance: don't run ad-hoc CREATE TABLE AS SELECT
- Add DAB caveat: synced_database_tables maps to Provisioned API, not
  Autoscaling. DAB support for postgres_synced_tables not yet available.
- Add creation + GRANT note to apps lakebase.md

Co-authored-by: Isaac
- Fix psql sslmode syntax (connection string instead of broken positional arg)
- Add scriptable psql workflow for agent automation
- Add endpoint path note for generate-database-credential
- Align GRANT SQL between SKILL.md and apps/lakebase.md (add ALTER DEFAULT PRIVILEGES)
- Surface DABs caveat in SKILL.md troubleshooting table
- Broaden latency decision gate triggers in apps SKILL.md
- Add "sync gold tables, not raw" as top best practice
- Clarify 8 TB storage limit scope (per branch)

Co-authored-by: Isaac
Restructure the data access decision in databricks-apps SKILL.md:
- Embed as Step 0 inside Development Workflow (not a separate section)
- Always present Analytics vs Synced Tables comparison table for UC reads
- Add hybrid apps note (--features analytics,lakebase)
- Remove keyword-scanning logic that agents rationalized past
- Add constraints section and app access cross-ref to synced-tables.md
- Remove duplicate content across reference files

Co-authored-by: Isaac

Co-authored-by: Orca <help@stably.ai>
- Swap column order: Synced Tables is now option (A) for primacy
- Strengthen instruction before the table: "you MUST show the table"
  and "Do not recommend one option over the other"
- Reframe "Setup complexity" → "Setup", "Warehouse required" →
  "Always available" (positive framing for synced tables)
- Add "operational apps" to synced tables Best For row

Co-authored-by: Isaac

Co-authored-by: Orca <help@stably.ai>
- Simplify comparison table to 3 user-facing rows (Speed, Best for, How it works)
- Move scaffold commands and links to agent-only notes section
- Add contextual recommendation: lean toward Synced Tables (A) unless
  clearly aggregation/dashboard use case
- Remove negative framing ("trade-off", "setup complexity")

Co-authored-by: Isaac

Co-authored-by: Orca <help@stably.ai>
- Remove duplicate GRANT SQL from apps/lakebase.md, keep canonical
  location in lakebase/SKILL.md with cross-reference
- Rename "Step 0" to "Data Access Decision Gate" for clarity since
  it's conditional (only applies when app reads UC data)
- Cast PGPORT env var to int in connectivity.md Pattern 4

Co-authored-by: Isaac

Co-authored-by: Orca <help@stably.ai>
@pkosiec pkosiec force-pushed the pkosiec/lakebase-synced-tables branch from ae8e36f to ea7dc03 Compare April 30, 2026 10:44
- Strengthen Decision Gate callout in Required Reading table
- Deduplicate synced tables guidance in appkit/lakebase.md (pointer to lakebase skill)
- Merge duplicate callout blocks into single block
- Add IP validation to resolve_host() in connectivity.md
- Add local-dev-only warning for Pattern 3 credentials
- Add synced tables row to "When to Use What" routing table
- Add expected JSON shape comment to scriptable CLI snippet
- Disambiguate overview.md lakebase rows with "(no --set flags needed)"
- Trim lakebase frontmatter description to one sentence
- Normalize "reverse ETL" phrasing across files
- Drop *Agent notes* pattern, use plain label
- Add dedicated-schema recommendation for synced tables grants

Co-authored-by: Isaac
@pkosiec pkosiec force-pushed the pkosiec/lakebase-synced-tables branch from 2373085 to bd9612f Compare April 30, 2026 11:02
@pkosiec pkosiec merged commit 5e839c4 into main Apr 30, 2026
1 check passed
jamesbroadhead added a commit that referenced this pull request Apr 30, 2026
Adds the SQL block (with `databricks_create_role` + DML grants) needed
once after the first deploy of an AppKit Lakebase app — without it the
SP gets `password authentication failed`. Promotes `databricks psql` as
the runnable form of #56's manual `generate-database-credential` recipe.
Also flags that the `databricks postgres create-role` CLI rejects every
SP-role payload, so agents stop trying to use it.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants