feat: add dbt project indexer for fast model lookup by ealexisaraujo · Pull Request #209 · getnao/nao

ealexisaraujo · 2026-02-16T01:00:40Z

Summary

New dbt indexer parses all SQL/YAML files in repos/ and generates searchable markdown index files (manifest.md + sources.md) per dbt project
Hooks into nao sync as a post-sync step, plus FastAPI startup, /api/refresh, and optional cron scheduler
Updates system prompt to instruct the agent to search dbt-index/ first before grepping raw SQL files
32 unit tests covering SQL parsing, YAML parsing, project discovery, and error handling

Motivation

When users asked the agent lineage or model questions (e.g., "What is the lineage of stg_orders?" or "What dbt model creates the fact_revenue table?"), the agent had to grep through hundreds of raw SQL files in repos/. This caused several problems:

Slow and unreliable — searching hundreds of files often timed out or returned incomplete results
Missed downstream dependencies — the agent could find a model's ref() calls (upstream), but had no efficient way to find which other models reference it (downstream). It would have to grep every single SQL file for the model name
No source-to-database mapping — users asking "What Snowflake table does source('core', 'dim_offers') point to?" had to manually correlate YAML source definitions with database schemas
Repeated work — every question re-scanned the same files. There was no cached representation of the project structure

Use cases this enables

Use case	Before	After
"What is the lineage of model X?"	Grep hundreds of SQL files, often miss downstream refs	Grep 1 file (`manifest.md`), find upstream `refs` + downstream in one search
"What creates table Y?"	Grep all files hoping the model name matches	Search `manifest.md` for the model name, immediately get path + materialization
"What source feeds into model Z?"	Read the SQL file, then find the YAML source def	`sources` field shows `source_name.table_name` inline in manifest
"Map this source to a database table"	Manually find YAML, parse database/schema	Grep `sources.md` for the source name, get database + schema + tables
"Show me all incremental models"	Impossible without scanning every file	Grep `manifest.md` for `materialized: incremental`

Why index at the Python (FastAPI/CLI) layer

FastAPI already owns the context lifecycle (refresh/startup)
Python has mature YAML and regex parsing (pyyaml already a dependency)
Output is file-based, consistent with the existing databases/ pattern
Both nao sync (CLI) and FastAPI (server) can import from nao_core.dbt_indexer

Architecture

Data flow

nao sync -p repositories
  │
  ├─ git clone/pull → repos/<name>/
  │
  └─ _index_dbt_projects()
       │
       ├─ find_dbt_projects()        ← scans repos/ for dbt_project.yml
       ├─ read_project_config()      ← reads project name, default materializations
       ├─ index_dbt_project()
       │    ├─ parse_yaml_sources()      ← YAML: source definitions
       │    ├─ parse_yaml_descriptions() ← YAML: model descriptions
       │    ├─ parse_sql_dependencies()  ← SQL: ref() and source() calls
       │    └─ parse_sql_config()        ← SQL: materialization config
       ├─ generate_manifest_md()     ← sorted alphabetically for grep
       └─ generate_sources_md()      ← source-to-database mapping

Output (generated in user's project folder)

<project>/
├── dbt-index/
│   └── <repo-name>/
│       ├── manifest.md    (grep-friendly, one entry per model)
│       └── sources.md     (source-to-database mapping)
├── repos/                 (raw git repos — still needed for full SQL)
└── databases/             (database schema docs)

Trigger points (4 ways to run)

Trigger	When	Where
`nao sync`	After repos are cloned/pulled	`cli/nao_core/commands/sync/__init__.py`
FastAPI startup	When `nao chat` starts	`main.py` lifespan
`POST /api/refresh`	Manual or webhook trigger	`main.py` refresh endpoint
Cron scheduler	`NAO_REFRESH_SCHEDULE` env var	`main.py` APScheduler

Example output

manifest.md (grep-friendly)

### stg_core__dim_products
- **path:** models/staging/core/stg_core__dim_products.sql
- **materialized:** view
- **refs:** —
- **sources:** core.dim_products

### int_orders__enriched
- **path:** models/transform/orders/int_orders__enriched.sql
- **materialized:** incremental
- **refs:** stg_core__dim_products, stg_core__fact_orders
- **sources:** —

The agent greps manifest.md for a model name and instantly finds:

Upstream: sources and refs fields
Downstream: grep for the model name in refs: lines of other models

Files changed

File	Change
`cli/nao_core/dbt_indexer.py`	New — Core indexer implementation (single source of truth): SQL/YAML parsing, manifest/sources generation
`apps/backend/fastapi/dbt_indexer.py`	New — Re-export shim so FastAPI imports work
`apps/backend/fastapi/main.py`	Hook indexer into startup, refresh endpoint, and scheduler
`apps/backend/src/agents/user-rules.ts`	Add `indexed` field to Repository type, detect `dbt-index/`
`apps/backend/src/components/system-prompt.tsx`	Tell agent to search `dbt-index/` first; show indexed status
`cli/nao_core/commands/sync/__init__.py`	Run indexer after `nao sync -p repositories`
`apps/backend/fastapi/test_dbt_indexer.py`	New — 32 unit tests

Performance

Tested on a real dbt project with 694 models + 129 YAML files — indexed in 0.83 seconds
Handles edge cases: directories named .sql, binary files, Jinja in YAML, malformed YAML

Test plan

python -m pytest apps/backend/fastapi/test_dbt_indexer.py -v — 32 tests pass
make lint (cli/) — ty, ruff check, ruff format all pass
npm run lint — TypeScript lint passes (0 errors)
Run nao sync -p repositories on real project — dbt-index/ generated correctly
Run nao chat — agent searches dbt-index/**/manifest.md first (verified)
Test lineage question — agent finds upstream + downstream from manifest
POST /api/refresh — returns 200, triggers re-indexing
Verify sources.md contains correct source-to-database mappings

Fixes #210

🤖 Generated with Claude Code

MatLBS · 2026-02-16T10:43:46Z

Hi ealexisaraujo 👋, thank you so much for your PR,

Overall the idea is really good but I am not sure about its implementation.

First, revert the package-lock.json...

I am convinced that your dbt_indexer is useful when running nao_sync, but I am not sure about the implementation in backend/fastapi. I don't understand the dbt_indexer.py in the fastapi folder?

Then, about the system prompt, I don't think it is useful to list all the repositories.
The first section you added already explains to search for manifest.md in dbt-index/

@Standlc what do you think ?

Bl3f · 2026-02-16T12:36:05Z

I will add a deeper review later today. I think we need this and it needs to be thought well to find where it sits.

Add repository discovery to detect synced git repos and whether they contain dbt projects, and expose that list to the system prompt so the assistant can search repos for dbt models and lineage. Also add ignore rules for common dbt generated folders to reduce noise during sync.

Parse all dbt SQL/YAML files in repos/ and generate searchable markdown index files (manifest.md + sources.md) per repository. The agent reads a single manifest.md instead of grepping through 694+ raw SQL files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The .context directory is fork-specific (local .git/info/exclude), not part of the upstream project. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Revert package-lock.json to upstream (unintentional metadata changes) - Delete unused FastAPI dbt_indexer.py re-export shim - Remove "Synced Repositories" section from system prompt - Remove indexed field detection logic from user-rules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ealexisaraujo · 2026-02-16T16:37:01Z

Thanks for the review @MatLBS!

Pushed changes addressing your feedback:

package-lock.json — Reverted. Those were unintentional metadata changes (dev → devOptional).
FastAPI dbt_indexer.py — Deleted. It was an unused re-export shim — main.py already imports directly from nao_core.dbt_indexer. The hooks in main.py (startup, scheduler, refresh) are the actual integration point — they keep the dbt-index fresh when repos change during active chat sessions via webhooks/scheduler.
Repository listing in system prompt — Removed. Agreed, the dbt-index instructions in "How nao Works" are sufficient for the agent to know what to do.

Also rebased on latest main to pick up skills (#201) and sync-observability (#207).

Happy to discuss the FastAPI hooks architecture further once @Bl3f has a chance to do a deeper review.

Bl3f

Hey, thank you so much for the contribution, to be honest this is great. There are a few things to change I think, a few part to refacto a bit or remove and the main question is around the manifest.md that might be a large file and how we should mitigate this in order to not overflow users context windows.

cli/nao_core/dbt_indexer.py

cli/nao_core/commands/sync/__init__.py

cli/nao_core/dbt_indexer.py

Bl3f · 2026-02-16T22:25:27Z

cli/nao_core/dbt_indexer.py

+            print(f"[dbt-indexer] Warning: failed to parse sources from {yaml_path}: {e}")
+
+    # Second pass: parse SQL models
+    for sql_path in models_dir.rglob("*.sql"):


I think the 3 forloops here could be factorised and make this code way easier to read.

The YAML loops are now 1 loop (via chain). The SQL loop stays separate since it depends on descriptions collected from the YAML pass — these are sequential by design (YAML first collects descriptions, then SQL uses them).

apps/backend/src/agents/user-rules.ts

Bl3f · 2026-02-16T22:25:33Z

cli/nao_core/dbt_indexer.py

+    print(f"[dbt-indexer] Indexed {repo_name}: {len(models)} models, {len(sources)} sources")
+
+
+# ---------------------------------------------------------------------------


All these comments can be removed, the function names are enough.

Done — removed all separator blocks and section headers.

Bl3f · 2026-02-16T22:25:36Z

apps/backend/src/agents/user-rules.ts

+};
+
+export function getRepositories(): Repository[] | null {
+	const projectFolder = env.NAO_DEFAULT_PROJECT_PATH;


This is legacy and should not be used (I know the getConnections does it but it shouldn't), we should be given the project information from the agent when building the system-prompt - this way we are more functional and creating the needed project isolation.

Agreed — both getRepositories() and getConnections() read the filesystem directly from env vars instead of receiving project context. Kept getRepositories() minimally for the hasDbtProjects conditional in the system prompt, but the broader refactor toward injected project isolation makes total sense. Happy to contribute to that in a follow-up — would be great to align on the target architecture for how project context flows into the system prompt.

cli/nao_core/dbt_indexer.py

Bl3f · 2026-02-16T22:25:42Z

cli/nao_core/commands/sync/__init__.py

 from rich.console import Console

 from nao_core.config import NaoConfig
+from nao_core.dbt_indexer import index_all_projects


Maybe we should add in the sync command an option to deactivate this indexing behaviour (or might it be an option on the repo configuration in nao_config.yaml. Like:

is_dbt_indexed: True | False

Good idea. I think this fits well as a follow-up once the core indexer is stable — an index_dbt: bool flag in repo config or a CLI flag like nao sync --skip-indexing. It would also pair nicely with PR #213's compile_dbt_docs flag, giving users fine-grained control over what happens during sync.

Open to implementing this in a follow-up PR if you'd like.

I think the best would be an option in the yaml like the compile_dbt_docs, tho naming should be uniform between the 2 flags. I don't think it should be in the CLI sync option at main level nonetheless.

apps/backend/src/components/system-prompt.tsx

Bl3f · 2026-02-16T22:25:48Z

apps/backend/src/agents/user-rules.ts

+				dbtProjectPath = `repos/${entry.name}`;
+			} else if (existsSync(subDbtProject)) {
+				hasDbtProject = true;
+				dbtProjectPath = `repos/${entry.name}/dbt`;


Not enough generic here as well.

Agreed — this ties into the broader getRepositories() refactor toward injected project context. Deferring this to a follow-up since it affects other consumers beyond our PR scope. Happy to contribute to that refactor when the direction is decided.

Bl3f · 2026-02-16T22:25:51Z

cli/nao_core/dbt_indexer.py

+        "---",
+    ]
+
+    for model in models:


I'm a bit concerned about the size of the generated manifest file in the end. You said you ran it on a ~700 models dbt project, the issue is that each model entry is roughly ~60 tokens, which in the end lead to a file with 37k characters - if the LLM decide to read it and add it to the context it will overflow the context window very fast.

Maybe we can think of a better structure (and not a single file to avoid this case), on the other we also have to work on a better system prompt for this, the fact that files can get out of the context window and read using a range (#193 PR does it)

Great point — this is the key design question. For our test project (694 models), the manifest is ~37k chars (~42k tokens). A few options I see:

Single file + grep + range reads — if PR Read tool: add start line offset and limit #193 lands, the agent can grep for a model name then read just that section. The current H3 headers (### model_name) are already grep-friendly. This keeps things simple and composable.

Split by directory — one manifest file per models/ subdirectory (e.g., staging/manifest.md, marts/manifest.md). Smaller files, but more to grep across.

Table-of-contents header — a lightweight index at the top mapping model names to line numbers, so the agent reads only the TOC first and targets specific ranges.

I lean toward option 1 since it's the simplest and leverages existing infrastructure, but open to whatever works best for the overall system. What's your preferred direction?

apps/backend/fastapi/main.py

Bl3f · 2026-02-16T22:25:57Z

cli/nao_core/dbt_indexer.py

+    if not projects:
+        return
+
+    index_root = project_folder / "dbt-index"


I don't know if dbt-index is a great name, what about dbt-projects, dbt-compiled or even without an hyphen which is usually not so good for folder names?

Open to renaming. dbt_projects (underscore, no hyphen) would be more consistent with Python conventions. dbt_index or dbt_compiled are also options. What's your preference? Happy to adjust to whatever naming convention fits the project best.

apps/backend/src/components/system-prompt.tsx

Bl3f · 2026-02-16T22:29:06Z

Also there is a small side effect with PR #213 as it generate the manifest.json which includes a lot of the data needed to build the manifest.md. We should decided what's the best solution. IMHO #213 is great a a bit simpler but requires for the context builder to get all dbt setup to run the dbt command, which can be a pain.

- Factorize duplicated .yml/.yaml loops using itertools.chain - Remove section separator comments (function names suffice) - Replace print() with UI class methods (UI.success, UI.warn, UI.error) - Flexible dbt project search (depth-based glob, not hardcoded "dbt/") - Move dbt indexing from generic sync to RepositorySyncProvider - Remove all dbt indexing hooks from FastAPI main.py - Make dbt system prompt instructions conditional on hasDbtProjects Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ealexisaraujo · 2026-02-16T22:55:58Z

Thanks for the thorough review @Bl3f — really appreciate the depth here. Pushed a refactoring commit addressing the actionable items. Here's the full breakdown:

Changes made

Code quality (dbt_indexer.py):

Factorized the duplicated .yml/.yaml loops into a single pass using itertools.chain
Removed all section separator comments (# -----) — function names are self-documenting
Replaced all print() calls with UI.success(), UI.warn(), UI.error() from nao_core.ui

Architecture:

Moved indexing into RepositorySyncProvider — no longer called from the generic sync/__init__.py. Indexing is now a post-sync step inside the repos provider, which is where it belongs since it's directly tied to repository content changes.
Removed all dbt indexing from FastAPI main.py — no startup hook, no scheduler hook, no refresh hook. Indexing happens exclusively via nao sync now.

Robustness:

Flexible dbt project discovery — replaced the hardcoded entry / "dbt" check with a depth-based glob search (_find_dbt_project_yml) that scans up to 3 levels deep. Handles repos where the dbt project lives in any subdirectory.
Conditional dbt prompt instructions — dbt-index guidance in system-prompt.tsx is now wrapped in {hasDbtProjects && (...)}. Users without dbt projects won't see misleading dbt-specific instructions.

Open for discussion

1. Manifest file size (the key question):

You're right that ~37k chars for 700 models is a concern. A few options I see:

Split by directory — one manifest per models/ subdirectory (staging/manifest.md, marts/manifest.md). Smaller files, but more to grep across.
Keep single file + grep + range reads — if PR Read tool: add start line offset and limit #193 lands, the agent can grep for a model name, then read just that section. The current H3 headers (### model_name) make this grep-friendly.
Add a TOC header — lightweight index at the top mapping model names to line numbers, so the agent reads only the TOC first.

I'm open to whichever direction fits best with the system's overall design. What's your preference?

2. Regarding PR #213 and our regex approach:

I think these approaches are complementary rather than competing:

Our regex indexer works without dbt installed (just Python + pyyaml) and covers the common ref()/source() patterns. The trade-off is it can't resolve macro-generated joins or dynamic refs.
PR feat(sync): pre-compile dbt docs for repo sources to improve centralized docs context #213's dbt docs generate gives complete compiled lineage but needs a full dbt setup.

One path: use our indexer as the default (works out of the box), and when compile_dbt_docs: true is set (#213), prefer the compiled manifest.json. But that adds complexity — open to other ideas on how these should coexist, or if one approach should be the winner.

3. Legacy getRepositories() pattern:

Agreed that getRepositories() and getConnections() reading the filesystem directly is legacy. Kept it minimally for the hasDbtProjects conditional check. The broader refactor toward injected project context is a separate effort — happy to help with that in a follow-up.

4. Config option & folder naming:

index_dbt: bool in repo config — makes sense as a follow-up once the core is stable.
Folder naming: open to renaming dbt-index to dbt_projects, dbt_index, or anything else. What sounds right to you?

Happy to iterate further on any of these!

cubic-dev-ai

2 issues found across 6 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/backend/fastapi/test_dbt_indexer.py">

<violation number="1" location="apps/backend/fastapi/test_dbt_indexer.py:119">
P2: Temp file leak: `NamedTemporaryFile(delete=False)` is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's `tmp_path` fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the `TemporaryDirectory` usage elsewhere in this file).</violation>
</file>

<file name="cli/nao_core/dbt_indexer.py">

<violation number="1" location="cli/nao_core/dbt_indexer.py:228">
P1: Bug: root-level default materialization from `dbt_project.yml` is silently discarded. The `pass` on this line means the project-wide `+materialized` value is never stored, so `_resolve_default_materialization`'s fallback to `_root_` always returns `None`. Models without explicit config or a directory-level default will show no materialization in the manifest.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-02-16T23:06:53Z

cli/nao_core/dbt_indexer.py

+def _walk_materialization_config(node: dict, result: dict[str, str]) -> None:
+    mat = node.get("+materialized") or node.get("materialized")
+    if isinstance(mat, str):
+        pass


P1: Bug: root-level default materialization from dbt_project.yml is silently discarded. The pass on this line means the project-wide +materialized value is never stored, so _resolve_default_materialization's fallback to _root_ always returns None. Models without explicit config or a directory-level default will show no materialization in the manifest.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At cli/nao_core/dbt_indexer.py, line 228: <comment>Bug: root-level default materialization from `dbt_project.yml` is silently discarded. The `pass` on this line means the project-wide `+materialized` value is never stored, so `_resolve_default_materialization`'s fallback to `_root_` always returns `None`. Models without explicit config or a directory-level default will show no materialization in the manifest.</comment> <file context> @@ -0,0 +1,421 @@ +def _walk_materialization_config(node: dict, result: dict[str, str]) -> None: + mat = node.get("+materialized") or node.get("materialized") + if isinstance(mat, str): + pass + + for key, value in node.items(): </file context>

cubic-dev-ai · 2026-02-16T23:06:53Z

apps/backend/fastapi/test_dbt_indexer.py

@@ -0,0 +1,503 @@
+import tempfile


P2: Temp file leak: NamedTemporaryFile(delete=False) is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's tmp_path fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the TemporaryDirectory usage elsewhere in this file).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At apps/backend/fastapi/test_dbt_indexer.py, line 119: <comment>Temp file leak: `NamedTemporaryFile(delete=False)` is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's `tmp_path` fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the `TemporaryDirectory` usage elsewhere in this file).</comment> <file context> @@ -0,0 +1,503 @@ + +class TestParseYamlSources: + def test_basic_sources(self): + with tempfile.NamedTemporaryFile( + suffix=".yml", mode="w", delete=False + ) as f: </file context>

Bl3f · 2026-02-18T10:04:00Z

Last 2 changes (if I'm not mistaken) to me are:

add the config in the nao_config.yaml, I want users to be able to deactivate this behaviour
name the created folder dbt/

ealexisaraujo mentioned this pull request Feb 16, 2026

feat: dbt project indexer for fast model lookup #210

Open

aaraujodata and others added 4 commits February 16, 2026 09:33

chore: revert .prettierignore change

3d76ef7

The .context directory is fork-specific (local .git/info/exclude), not part of the upstream project. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ealexisaraujo force-pushed the feat/sync-dbt-repositories branch from 47b3a59 to 244742b Compare February 16, 2026 16:36