Skip to content

feat: add dbt project indexer for fast model lookup#209

Open
ealexisaraujo wants to merge 5 commits intogetnao:mainfrom
ealexisaraujo:feat/sync-dbt-repositories
Open

feat: add dbt project indexer for fast model lookup#209
ealexisaraujo wants to merge 5 commits intogetnao:mainfrom
ealexisaraujo:feat/sync-dbt-repositories

Conversation

@ealexisaraujo
Copy link
Contributor

@ealexisaraujo ealexisaraujo commented Feb 16, 2026

Summary

  • New dbt indexer parses all SQL/YAML files in repos/ and generates searchable markdown index files (manifest.md + sources.md) per dbt project
  • Hooks into nao sync as a post-sync step, plus FastAPI startup, /api/refresh, and optional cron scheduler
  • Updates system prompt to instruct the agent to search dbt-index/ first before grepping raw SQL files
  • 32 unit tests covering SQL parsing, YAML parsing, project discovery, and error handling

Motivation

When users asked the agent lineage or model questions (e.g., "What is the lineage of stg_orders?" or "What dbt model creates the fact_revenue table?"), the agent had to grep through hundreds of raw SQL files in repos/. This caused several problems:

  1. Slow and unreliable — searching hundreds of files often timed out or returned incomplete results
  2. Missed downstream dependencies — the agent could find a model's ref() calls (upstream), but had no efficient way to find which other models reference it (downstream). It would have to grep every single SQL file for the model name
  3. No source-to-database mapping — users asking "What Snowflake table does source('core', 'dim_offers') point to?" had to manually correlate YAML source definitions with database schemas
  4. Repeated work — every question re-scanned the same files. There was no cached representation of the project structure

Use cases this enables

Use case Before After
"What is the lineage of model X?" Grep hundreds of SQL files, often miss downstream refs Grep 1 file (manifest.md), find upstream refs + downstream in one search
"What creates table Y?" Grep all files hoping the model name matches Search manifest.md for the model name, immediately get path + materialization
"What source feeds into model Z?" Read the SQL file, then find the YAML source def sources field shows source_name.table_name inline in manifest
"Map this source to a database table" Manually find YAML, parse database/schema Grep sources.md for the source name, get database + schema + tables
"Show me all incremental models" Impossible without scanning every file Grep manifest.md for materialized: incremental

Why index at the Python (FastAPI/CLI) layer

  • FastAPI already owns the context lifecycle (refresh/startup)
  • Python has mature YAML and regex parsing (pyyaml already a dependency)
  • Output is file-based, consistent with the existing databases/ pattern
  • Both nao sync (CLI) and FastAPI (server) can import from nao_core.dbt_indexer

Architecture

Data flow

nao sync -p repositories
  │
  ├─ git clone/pull → repos/<name>/
  │
  └─ _index_dbt_projects()
       │
       ├─ find_dbt_projects()        ← scans repos/ for dbt_project.yml
       ├─ read_project_config()      ← reads project name, default materializations
       ├─ index_dbt_project()
       │    ├─ parse_yaml_sources()      ← YAML: source definitions
       │    ├─ parse_yaml_descriptions() ← YAML: model descriptions
       │    ├─ parse_sql_dependencies()  ← SQL: ref() and source() calls
       │    └─ parse_sql_config()        ← SQL: materialization config
       ├─ generate_manifest_md()     ← sorted alphabetically for grep
       └─ generate_sources_md()      ← source-to-database mapping

Output (generated in user's project folder)

<project>/
├── dbt-index/
│   └── <repo-name>/
│       ├── manifest.md    (grep-friendly, one entry per model)
│       └── sources.md     (source-to-database mapping)
├── repos/                 (raw git repos — still needed for full SQL)
└── databases/             (database schema docs)

Trigger points (4 ways to run)

Trigger When Where
nao sync After repos are cloned/pulled cli/nao_core/commands/sync/__init__.py
FastAPI startup When nao chat starts main.py lifespan
POST /api/refresh Manual or webhook trigger main.py refresh endpoint
Cron scheduler NAO_REFRESH_SCHEDULE env var main.py APScheduler

Example output

manifest.md (grep-friendly)

### stg_core__dim_products
- **path:** models/staging/core/stg_core__dim_products.sql
- **materialized:** view
- **refs:**- **sources:** core.dim_products

### int_orders__enriched
- **path:** models/transform/orders/int_orders__enriched.sql
- **materialized:** incremental
- **refs:** stg_core__dim_products, stg_core__fact_orders
- **sources:**

The agent greps manifest.md for a model name and instantly finds:

  • Upstream: sources and refs fields
  • Downstream: grep for the model name in refs: lines of other models

Files changed

File Change
cli/nao_core/dbt_indexer.py New — Core indexer implementation (single source of truth): SQL/YAML parsing, manifest/sources generation
apps/backend/fastapi/dbt_indexer.py New — Re-export shim so FastAPI imports work
apps/backend/fastapi/main.py Hook indexer into startup, refresh endpoint, and scheduler
apps/backend/src/agents/user-rules.ts Add indexed field to Repository type, detect dbt-index/
apps/backend/src/components/system-prompt.tsx Tell agent to search dbt-index/ first; show indexed status
cli/nao_core/commands/sync/__init__.py Run indexer after nao sync -p repositories
apps/backend/fastapi/test_dbt_indexer.py New — 32 unit tests

Performance

  • Tested on a real dbt project with 694 models + 129 YAML files — indexed in 0.83 seconds
  • Handles edge cases: directories named .sql, binary files, Jinja in YAML, malformed YAML

Test plan

  • python -m pytest apps/backend/fastapi/test_dbt_indexer.py -v — 32 tests pass
  • make lint (cli/) — ty, ruff check, ruff format all pass
  • npm run lint — TypeScript lint passes (0 errors)
  • Run nao sync -p repositories on real project — dbt-index/ generated correctly
  • Run nao chat — agent searches dbt-index/**/manifest.md first (verified)
  • Test lineage question — agent finds upstream + downstream from manifest
  • POST /api/refresh — returns 200, triggers re-indexing
  • Verify sources.md contains correct source-to-database mappings

Fixes #210

🤖 Generated with Claude Code

@MatLBS
Copy link
Contributor

MatLBS commented Feb 16, 2026

Hi ealexisaraujo 👋, thank you so much for your PR,

Overall the idea is really good but I am not sure about its implementation.

First, revert the package-lock.json...

I am convinced that your dbt_indexer is useful when running nao_sync, but I am not sure about the implementation in backend/fastapi. I don't understand the dbt_indexer.py in the fastapi folder?

Then, about the system prompt, I don't think it is useful to list all the repositories.
The first section you added already explains to search for manifest.md in dbt-index/

@Standlc what do you think ?

@Bl3f
Copy link
Contributor

Bl3f commented Feb 16, 2026

I will add a deeper review later today. I think we need this and it needs to be thought well to find where it sits.

aaraujodata and others added 4 commits February 16, 2026 09:33
Add repository discovery to detect synced git repos and whether they contain dbt projects, and expose that list to the system prompt so the assistant can search repos for dbt models and lineage. Also add ignore rules for common dbt generated folders to reduce noise during sync.
Parse all dbt SQL/YAML files in repos/ and generate searchable markdown
index files (manifest.md + sources.md) per repository. The agent reads
a single manifest.md instead of grepping through 694+ raw SQL files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The .context directory is fork-specific (local .git/info/exclude), not
part of the upstream project.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert package-lock.json to upstream (unintentional metadata changes)
- Delete unused FastAPI dbt_indexer.py re-export shim
- Remove "Synced Repositories" section from system prompt
- Remove indexed field detection logic from user-rules

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ealexisaraujo ealexisaraujo force-pushed the feat/sync-dbt-repositories branch from 47b3a59 to 244742b Compare February 16, 2026 16:36
@ealexisaraujo
Copy link
Contributor Author

Thanks for the review @MatLBS!

Pushed changes addressing your feedback:

  1. package-lock.json — Reverted. Those were unintentional metadata changes (devdevOptional).

  2. FastAPI dbt_indexer.py — Deleted. It was an unused re-export shim — main.py already imports directly from nao_core.dbt_indexer. The hooks in main.py (startup, scheduler, refresh) are the actual integration point — they keep the dbt-index fresh when repos change during active chat sessions via webhooks/scheduler.

  3. Repository listing in system prompt — Removed. Agreed, the dbt-index instructions in "How nao Works" are sufficient for the agent to know what to do.

Also rebased on latest main to pick up skills (#201) and sync-observability (#207).

Happy to discuss the FastAPI hooks architecture further once @Bl3f has a chance to do a deeper review.

Copy link
Contributor

@Bl3f Bl3f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thank you so much for the contribution, to be honest this is great. There are a few things to change I think, a few part to refacto a bit or remove and the main question is around the manifest.md that might be a large file and how we should mitigate this in order to not overflow users context windows.

print(f"[dbt-indexer] Warning: failed to parse sources from {yaml_path}: {e}")

# Second pass: parse SQL models
for sql_path in models_dir.rglob("*.sql"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 3 forloops here could be factorised and make this code way easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML loops are now 1 loop (via chain). The SQL loop stays separate since it depends on descriptions collected from the YAML pass — these are sequential by design (YAML first collects descriptions, then SQL uses them).

print(f"[dbt-indexer] Indexed {repo_name}: {len(models)} models, {len(sources)} sources")


# ---------------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these comments can be removed, the function names are enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed all separator blocks and section headers.

};

export function getRepositories(): Repository[] | null {
const projectFolder = env.NAO_DEFAULT_PROJECT_PATH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is legacy and should not be used (I know the getConnections does it but it shouldn't), we should be given the project information from the agent when building the system-prompt - this way we are more functional and creating the needed project isolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — both getRepositories() and getConnections() read the filesystem directly from env vars instead of receiving project context. Kept getRepositories() minimally for the hasDbtProjects conditional in the system prompt, but the broader refactor toward injected project isolation makes total sense. Happy to contribute to that in a follow-up — would be great to align on the target architecture for how project context flows into the system prompt.

from rich.console import Console

from nao_core.config import NaoConfig
from nao_core.dbt_indexer import index_all_projects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add in the sync command an option to deactivate this indexing behaviour (or might it be an option on the repo configuration in nao_config.yaml. Like:

is_dbt_indexed: True | False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I think this fits well as a follow-up once the core indexer is stable — an index_dbt: bool flag in repo config or a CLI flag like nao sync --skip-indexing. It would also pair nicely with PR #213's compile_dbt_docs flag, giving users fine-grained control over what happens during sync.

Open to implementing this in a follow-up PR if you'd like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best would be an option in the yaml like the compile_dbt_docs, tho naming should be uniform between the 2 flags. I don't think it should be in the CLI sync option at main level nonetheless.

dbtProjectPath = `repos/${entry.name}`;
} else if (existsSync(subDbtProject)) {
hasDbtProject = true;
dbtProjectPath = `repos/${entry.name}/dbt`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough generic here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — this ties into the broader getRepositories() refactor toward injected project context. Deferring this to a follow-up since it affects other consumers beyond our PR scope. Happy to contribute to that refactor when the direction is decided.

"---",
]

for model in models:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the size of the generated manifest file in the end. You said you ran it on a ~700 models dbt project, the issue is that each model entry is roughly ~60 tokens, which in the end lead to a file with 37k characters - if the LLM decide to read it and add it to the context it will overflow the context window very fast.

Maybe we can think of a better structure (and not a single file to avoid this case), on the other we also have to work on a better system prompt for this, the fact that files can get out of the context window and read using a range (#193 PR does it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point — this is the key design question. For our test project (694 models), the manifest is ~37k chars (~42k tokens). A few options I see:

  1. Single file + grep + range reads — if PR Read tool: add start line offset and limit #193 lands, the agent can grep for a model name then read just that section. The current H3 headers (### model_name) are already grep-friendly. This keeps things simple and composable.
  2. Split by directory — one manifest file per models/ subdirectory (e.g., staging/manifest.md, marts/manifest.md). Smaller files, but more to grep across.
  3. Table-of-contents header — a lightweight index at the top mapping model names to line numbers, so the agent reads only the TOC first and targets specific ranges.

I lean toward option 1 since it's the simplest and leverages existing infrastructure, but open to whatever works best for the overall system. What's your preferred direction?

if not projects:
return

index_root = project_folder / "dbt-index"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if dbt-index is a great name, what about dbt-projects, dbt-compiled or even without an hyphen which is usually not so good for folder names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to renaming. dbt_projects (underscore, no hyphen) would be more consistent with Python conventions. dbt_index or dbt_compiled are also options. What's your preference? Happy to adjust to whatever naming convention fits the project best.

@Bl3f
Copy link
Contributor

Bl3f commented Feb 16, 2026

Also there is a small side effect with PR #213 as it generate the manifest.json which includes a lot of the data needed to build the manifest.md. We should decided what's the best solution. IMHO #213 is great a a bit simpler but requires for the context builder to get all dbt setup to run the dbt command, which can be a pain.

- Factorize duplicated .yml/.yaml loops using itertools.chain
- Remove section separator comments (function names suffice)
- Replace print() with UI class methods (UI.success, UI.warn, UI.error)
- Flexible dbt project search (depth-based glob, not hardcoded "dbt/")
- Move dbt indexing from generic sync to RepositorySyncProvider
- Remove all dbt indexing hooks from FastAPI main.py
- Make dbt system prompt instructions conditional on hasDbtProjects

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ealexisaraujo
Copy link
Contributor Author

Thanks for the thorough review @Bl3f — really appreciate the depth here. Pushed a refactoring commit addressing the actionable items. Here's the full breakdown:

Changes made

Code quality (dbt_indexer.py):

  • Factorized the duplicated .yml/.yaml loops into a single pass using itertools.chain
  • Removed all section separator comments (# -----) — function names are self-documenting
  • Replaced all print() calls with UI.success(), UI.warn(), UI.error() from nao_core.ui

Architecture:

  • Moved indexing into RepositorySyncProvider — no longer called from the generic sync/__init__.py. Indexing is now a post-sync step inside the repos provider, which is where it belongs since it's directly tied to repository content changes.
  • Removed all dbt indexing from FastAPI main.py — no startup hook, no scheduler hook, no refresh hook. Indexing happens exclusively via nao sync now.

Robustness:

  • Flexible dbt project discovery — replaced the hardcoded entry / "dbt" check with a depth-based glob search (_find_dbt_project_yml) that scans up to 3 levels deep. Handles repos where the dbt project lives in any subdirectory.
  • Conditional dbt prompt instructions — dbt-index guidance in system-prompt.tsx is now wrapped in {hasDbtProjects && (...)}. Users without dbt projects won't see misleading dbt-specific instructions.

Open for discussion

1. Manifest file size (the key question):

You're right that ~37k chars for 700 models is a concern. A few options I see:

  • Split by directory — one manifest per models/ subdirectory (staging/manifest.md, marts/manifest.md). Smaller files, but more to grep across.
  • Keep single file + grep + range reads — if PR Read tool: add start line offset and limit #193 lands, the agent can grep for a model name, then read just that section. The current H3 headers (### model_name) make this grep-friendly.
  • Add a TOC header — lightweight index at the top mapping model names to line numbers, so the agent reads only the TOC first.

I'm open to whichever direction fits best with the system's overall design. What's your preference?

2. Regarding PR #213 and our regex approach:

I think these approaches are complementary rather than competing:

One path: use our indexer as the default (works out of the box), and when compile_dbt_docs: true is set (#213), prefer the compiled manifest.json. But that adds complexity — open to other ideas on how these should coexist, or if one approach should be the winner.

3. Legacy getRepositories() pattern:

Agreed that getRepositories() and getConnections() reading the filesystem directly is legacy. Kept it minimally for the hasDbtProjects conditional check. The broader refactor toward injected project context is a separate effort — happy to help with that in a follow-up.

4. Config option & folder naming:

  • index_dbt: bool in repo config — makes sense as a follow-up once the core is stable.
  • Folder naming: open to renaming dbt-index to dbt_projects, dbt_index, or anything else. What sounds right to you?

Happy to iterate further on any of these!

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/backend/fastapi/test_dbt_indexer.py">

<violation number="1" location="apps/backend/fastapi/test_dbt_indexer.py:119">
P2: Temp file leak: `NamedTemporaryFile(delete=False)` is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's `tmp_path` fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the `TemporaryDirectory` usage elsewhere in this file).</violation>
</file>

<file name="cli/nao_core/dbt_indexer.py">

<violation number="1" location="cli/nao_core/dbt_indexer.py:228">
P1: Bug: root-level default materialization from `dbt_project.yml` is silently discarded. The `pass` on this line means the project-wide `+materialized` value is never stored, so `_resolve_default_materialization`'s fallback to `_root_` always returns `None`. Models without explicit config or a directory-level default will show no materialization in the manifest.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

def _walk_materialization_config(node: dict, result: dict[str, str]) -> None:
mat = node.get("+materialized") or node.get("materialized")
if isinstance(mat, str):
pass
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Bug: root-level default materialization from dbt_project.yml is silently discarded. The pass on this line means the project-wide +materialized value is never stored, so _resolve_default_materialization's fallback to _root_ always returns None. Models without explicit config or a directory-level default will show no materialization in the manifest.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At cli/nao_core/dbt_indexer.py, line 228:

<comment>Bug: root-level default materialization from `dbt_project.yml` is silently discarded. The `pass` on this line means the project-wide `+materialized` value is never stored, so `_resolve_default_materialization`'s fallback to `_root_` always returns `None`. Models without explicit config or a directory-level default will show no materialization in the manifest.</comment>

<file context>
@@ -0,0 +1,421 @@
+def _walk_materialization_config(node: dict, result: dict[str, str]) -> None:
+    mat = node.get("+materialized") or node.get("materialized")
+    if isinstance(mat, str):
+        pass
+
+    for key, value in node.items():
</file context>
Fix with Cubic

@@ -0,0 +1,503 @@
import tempfile
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Temp file leak: NamedTemporaryFile(delete=False) is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's tmp_path fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the TemporaryDirectory usage elsewhere in this file).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/backend/fastapi/test_dbt_indexer.py, line 119:

<comment>Temp file leak: `NamedTemporaryFile(delete=False)` is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's `tmp_path` fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the `TemporaryDirectory` usage elsewhere in this file).</comment>

<file context>
@@ -0,0 +1,503 @@
+
+class TestParseYamlSources:
+    def test_basic_sources(self):
+        with tempfile.NamedTemporaryFile(
+            suffix=".yml", mode="w", delete=False
+        ) as f:
</file context>
Fix with Cubic

@Bl3f
Copy link
Contributor

Bl3f commented Feb 18, 2026

Last 2 changes (if I'm not mistaken) to me are:

  • add the config in the nao_config.yaml, I want users to be able to deactivate this behaviour
  • name the created folder dbt/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: dbt project indexer for fast model lookup

4 participants