feat: add dbt project indexer for fast model lookup#209
feat: add dbt project indexer for fast model lookup#209ealexisaraujo wants to merge 5 commits intogetnao:mainfrom
Conversation
|
Hi ealexisaraujo 👋, thank you so much for your PR, Overall the idea is really good but I am not sure about its implementation. First, revert the package-lock.json... I am convinced that your dbt_indexer is useful when running nao_sync, but I am not sure about the implementation in backend/fastapi. I don't understand the dbt_indexer.py in the fastapi folder? Then, about the system prompt, I don't think it is useful to list all the repositories. @Standlc what do you think ? |
|
I will add a deeper review later today. I think we need this and it needs to be thought well to find where it sits. |
Add repository discovery to detect synced git repos and whether they contain dbt projects, and expose that list to the system prompt so the assistant can search repos for dbt models and lineage. Also add ignore rules for common dbt generated folders to reduce noise during sync.
Parse all dbt SQL/YAML files in repos/ and generate searchable markdown index files (manifest.md + sources.md) per repository. The agent reads a single manifest.md instead of grepping through 694+ raw SQL files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The .context directory is fork-specific (local .git/info/exclude), not part of the upstream project. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert package-lock.json to upstream (unintentional metadata changes) - Delete unused FastAPI dbt_indexer.py re-export shim - Remove "Synced Repositories" section from system prompt - Remove indexed field detection logic from user-rules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
47b3a59 to
244742b
Compare
|
Thanks for the review @MatLBS! Pushed changes addressing your feedback:
Also rebased on latest Happy to discuss the FastAPI hooks architecture further once @Bl3f has a chance to do a deeper review. |
Bl3f
left a comment
There was a problem hiding this comment.
Hey, thank you so much for the contribution, to be honest this is great. There are a few things to change I think, a few part to refacto a bit or remove and the main question is around the manifest.md that might be a large file and how we should mitigate this in order to not overflow users context windows.
| print(f"[dbt-indexer] Warning: failed to parse sources from {yaml_path}: {e}") | ||
|
|
||
| # Second pass: parse SQL models | ||
| for sql_path in models_dir.rglob("*.sql"): |
There was a problem hiding this comment.
I think the 3 forloops here could be factorised and make this code way easier to read.
There was a problem hiding this comment.
The YAML loops are now 1 loop (via chain). The SQL loop stays separate since it depends on descriptions collected from the YAML pass — these are sequential by design (YAML first collects descriptions, then SQL uses them).
cli/nao_core/dbt_indexer.py
Outdated
| print(f"[dbt-indexer] Indexed {repo_name}: {len(models)} models, {len(sources)} sources") | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
All these comments can be removed, the function names are enough.
There was a problem hiding this comment.
Done — removed all separator blocks and section headers.
| }; | ||
|
|
||
| export function getRepositories(): Repository[] | null { | ||
| const projectFolder = env.NAO_DEFAULT_PROJECT_PATH; |
There was a problem hiding this comment.
This is legacy and should not be used (I know the getConnections does it but it shouldn't), we should be given the project information from the agent when building the system-prompt - this way we are more functional and creating the needed project isolation.
There was a problem hiding this comment.
Agreed — both getRepositories() and getConnections() read the filesystem directly from env vars instead of receiving project context. Kept getRepositories() minimally for the hasDbtProjects conditional in the system prompt, but the broader refactor toward injected project isolation makes total sense. Happy to contribute to that in a follow-up — would be great to align on the target architecture for how project context flows into the system prompt.
| from rich.console import Console | ||
|
|
||
| from nao_core.config import NaoConfig | ||
| from nao_core.dbt_indexer import index_all_projects |
There was a problem hiding this comment.
Maybe we should add in the sync command an option to deactivate this indexing behaviour (or might it be an option on the repo configuration in nao_config.yaml. Like:
is_dbt_indexed: True | False
There was a problem hiding this comment.
Good idea. I think this fits well as a follow-up once the core indexer is stable — an index_dbt: bool flag in repo config or a CLI flag like nao sync --skip-indexing. It would also pair nicely with PR #213's compile_dbt_docs flag, giving users fine-grained control over what happens during sync.
Open to implementing this in a follow-up PR if you'd like.
There was a problem hiding this comment.
I think the best would be an option in the yaml like the compile_dbt_docs, tho naming should be uniform between the 2 flags. I don't think it should be in the CLI sync option at main level nonetheless.
| dbtProjectPath = `repos/${entry.name}`; | ||
| } else if (existsSync(subDbtProject)) { | ||
| hasDbtProject = true; | ||
| dbtProjectPath = `repos/${entry.name}/dbt`; |
There was a problem hiding this comment.
Not enough generic here as well.
There was a problem hiding this comment.
Agreed — this ties into the broader getRepositories() refactor toward injected project context. Deferring this to a follow-up since it affects other consumers beyond our PR scope. Happy to contribute to that refactor when the direction is decided.
| "---", | ||
| ] | ||
|
|
||
| for model in models: |
There was a problem hiding this comment.
I'm a bit concerned about the size of the generated manifest file in the end. You said you ran it on a ~700 models dbt project, the issue is that each model entry is roughly ~60 tokens, which in the end lead to a file with 37k characters - if the LLM decide to read it and add it to the context it will overflow the context window very fast.
Maybe we can think of a better structure (and not a single file to avoid this case), on the other we also have to work on a better system prompt for this, the fact that files can get out of the context window and read using a range (#193 PR does it)
There was a problem hiding this comment.
Great point — this is the key design question. For our test project (694 models), the manifest is ~37k chars (~42k tokens). A few options I see:
- Single file + grep + range reads — if PR Read tool: add start line offset and limit #193 lands, the agent can grep for a model name then read just that section. The current H3 headers (
### model_name) are already grep-friendly. This keeps things simple and composable. - Split by directory — one manifest file per
models/subdirectory (e.g.,staging/manifest.md,marts/manifest.md). Smaller files, but more to grep across. - Table-of-contents header — a lightweight index at the top mapping model names to line numbers, so the agent reads only the TOC first and targets specific ranges.
I lean toward option 1 since it's the simplest and leverages existing infrastructure, but open to whatever works best for the overall system. What's your preferred direction?
| if not projects: | ||
| return | ||
|
|
||
| index_root = project_folder / "dbt-index" |
There was a problem hiding this comment.
I don't know if dbt-index is a great name, what about dbt-projects, dbt-compiled or even without an hyphen which is usually not so good for folder names?
There was a problem hiding this comment.
Open to renaming. dbt_projects (underscore, no hyphen) would be more consistent with Python conventions. dbt_index or dbt_compiled are also options. What's your preference? Happy to adjust to whatever naming convention fits the project best.
|
Also there is a small side effect with PR #213 as it generate the |
- Factorize duplicated .yml/.yaml loops using itertools.chain - Remove section separator comments (function names suffice) - Replace print() with UI class methods (UI.success, UI.warn, UI.error) - Flexible dbt project search (depth-based glob, not hardcoded "dbt/") - Move dbt indexing from generic sync to RepositorySyncProvider - Remove all dbt indexing hooks from FastAPI main.py - Make dbt system prompt instructions conditional on hasDbtProjects Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the thorough review @Bl3f — really appreciate the depth here. Pushed a refactoring commit addressing the actionable items. Here's the full breakdown: Changes madeCode quality (
Architecture:
Robustness:
Open for discussion1. Manifest file size (the key question): You're right that ~37k chars for 700 models is a concern. A few options I see:
I'm open to whichever direction fits best with the system's overall design. What's your preference? 2. Regarding PR #213 and our regex approach: I think these approaches are complementary rather than competing:
One path: use our indexer as the default (works out of the box), and when 3. Legacy Agreed that 4. Config option & folder naming:
Happy to iterate further on any of these! |
There was a problem hiding this comment.
2 issues found across 6 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/backend/fastapi/test_dbt_indexer.py">
<violation number="1" location="apps/backend/fastapi/test_dbt_indexer.py:119">
P2: Temp file leak: `NamedTemporaryFile(delete=False)` is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's `tmp_path` fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the `TemporaryDirectory` usage elsewhere in this file).</violation>
</file>
<file name="cli/nao_core/dbt_indexer.py">
<violation number="1" location="cli/nao_core/dbt_indexer.py:228">
P1: Bug: root-level default materialization from `dbt_project.yml` is silently discarded. The `pass` on this line means the project-wide `+materialized` value is never stored, so `_resolve_default_materialization`'s fallback to `_root_` always returns `None`. Models without explicit config or a directory-level default will show no materialization in the manifest.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Add one-off context when rerunning by tagging
@cubic-dev-aiwith guidance or docs links (includingllms.txt) - Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| def _walk_materialization_config(node: dict, result: dict[str, str]) -> None: | ||
| mat = node.get("+materialized") or node.get("materialized") | ||
| if isinstance(mat, str): | ||
| pass |
There was a problem hiding this comment.
P1: Bug: root-level default materialization from dbt_project.yml is silently discarded. The pass on this line means the project-wide +materialized value is never stored, so _resolve_default_materialization's fallback to _root_ always returns None. Models without explicit config or a directory-level default will show no materialization in the manifest.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At cli/nao_core/dbt_indexer.py, line 228:
<comment>Bug: root-level default materialization from `dbt_project.yml` is silently discarded. The `pass` on this line means the project-wide `+materialized` value is never stored, so `_resolve_default_materialization`'s fallback to `_root_` always returns `None`. Models without explicit config or a directory-level default will show no materialization in the manifest.</comment>
<file context>
@@ -0,0 +1,421 @@
+def _walk_materialization_config(node: dict, result: dict[str, str]) -> None:
+ mat = node.get("+materialized") or node.get("materialized")
+ if isinstance(mat, str):
+ pass
+
+ for key, value in node.items():
</file context>
| @@ -0,0 +1,503 @@ | |||
| import tempfile | |||
There was a problem hiding this comment.
P2: Temp file leak: NamedTemporaryFile(delete=False) is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's tmp_path fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the TemporaryDirectory usage elsewhere in this file).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/backend/fastapi/test_dbt_indexer.py, line 119:
<comment>Temp file leak: `NamedTemporaryFile(delete=False)` is used 8 times in this file but the files are never cleaned up. Each test run accumulates orphaned temp files. Consider using pytest's `tmp_path` fixture, which auto-cleans and is already the idiomatic pytest pattern (and consistent with the `TemporaryDirectory` usage elsewhere in this file).</comment>
<file context>
@@ -0,0 +1,503 @@
+
+class TestParseYamlSources:
+ def test_basic_sources(self):
+ with tempfile.NamedTemporaryFile(
+ suffix=".yml", mode="w", delete=False
+ ) as f:
</file context>
|
Last 2 changes (if I'm not mistaken) to me are:
|
Summary
repos/and generates searchable markdown index files (manifest.md+sources.md) per dbt projectnao syncas a post-sync step, plus FastAPI startup,/api/refresh, and optional cron schedulerdbt-index/first before grepping raw SQL filesMotivation
When users asked the agent lineage or model questions (e.g., "What is the lineage of stg_orders?" or "What dbt model creates the fact_revenue table?"), the agent had to grep through hundreds of raw SQL files in
repos/. This caused several problems:ref()calls (upstream), but had no efficient way to find which other models reference it (downstream). It would have to grep every single SQL file for the model namesource('core', 'dim_offers')point to?" had to manually correlate YAML source definitions with database schemasUse cases this enables
manifest.md), find upstreamrefs+ downstream in one searchmanifest.mdfor the model name, immediately get path + materializationsourcesfield showssource_name.table_nameinline in manifestsources.mdfor the source name, get database + schema + tablesmanifest.mdformaterialized: incrementalWhy index at the Python (FastAPI/CLI) layer
pyyamlalready a dependency)databases/patternnao sync(CLI) and FastAPI (server) can import fromnao_core.dbt_indexerArchitecture
Data flow
Output (generated in user's project folder)
Trigger points (4 ways to run)
nao synccli/nao_core/commands/sync/__init__.pynao chatstartsmain.pylifespanPOST /api/refreshmain.pyrefresh endpointNAO_REFRESH_SCHEDULEenv varmain.pyAPSchedulerExample output
manifest.md (grep-friendly)
The agent greps
manifest.mdfor a model name and instantly finds:sourcesandrefsfieldsrefs:lines of other modelsFiles changed
cli/nao_core/dbt_indexer.pyapps/backend/fastapi/dbt_indexer.pyapps/backend/fastapi/main.pyapps/backend/src/agents/user-rules.tsindexedfield to Repository type, detectdbt-index/apps/backend/src/components/system-prompt.tsxdbt-index/first; show indexed statuscli/nao_core/commands/sync/__init__.pynao sync -p repositoriesapps/backend/fastapi/test_dbt_indexer.pyPerformance
.sql, binary files, Jinja in YAML, malformed YAMLTest plan
python -m pytest apps/backend/fastapi/test_dbt_indexer.py -v— 32 tests passmake lint(cli/) — ty, ruff check, ruff format all passnpm run lint— TypeScript lint passes (0 errors)nao sync -p repositorieson real project —dbt-index/generated correctlynao chat— agent searchesdbt-index/**/manifest.mdfirst (verified)POST /api/refresh— returns 200, triggers re-indexingsources.mdcontains correct source-to-database mappingsFixes #210
🤖 Generated with Claude Code