Skip to content

Add dbt guidelines as separate file#13

Open
soltanianalytics wants to merge 2 commits intomasterfrom
best_practice_revamp_202507
Open

Add dbt guidelines as separate file#13
soltanianalytics wants to merge 2 commits intomasterfrom
best_practice_revamp_202507

Conversation

@soltanianalytics
Copy link
Copy Markdown
Member

No description provided.

Copy link
Copy Markdown
Contributor

@lpillmann lpillmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the additions a lot! I think it is a clear and easy read overall.

I left a few comments, some of them are more relevant. Feel free to ignore the non-blocking ones :)

Comment thread DBT_GUIDELINES.md Outdated
Comment on lines +93 to +97
There should be one folder per source with:
- The related base models
- The `base_<source>_docs foder`, containing the md files of the models
- The `base_<source>_schemas folder`, containing the yml files of the models
- The `sources.yml` files for that source
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea (non-blocking): at HomeToGo/Smoobu the convention was one YAML file per model, e.g.

/models/base/base_chargebee__invoices.sql
/models/base/base_chargebee__invoices.yml

I found it much easier to work with than when it is all combined in a single schemas.yml, because no CTRL+F is needed. You simply search for the filename it you have it

I heard some other people considering this approach at Gemma, so I'm bringing it for discussion here. What are your thoughts?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yess, that's exactly what i want to do! I definitely should explain it better though haha will add a comment :)

Copy link
Copy Markdown

@elenalfo elenalfo Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpillmann do you have a strong opinion about having folders for schemas and md files? (like 1 file x model but in a subfolder) like:
image

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no folder

Comment thread DBT_GUIDELINES.md
* If you need more complex transformations to make the source useful, consider adding an `interim_` table.

* Only `interim_` models should:
* use aggregates, window functions, joins necessary to clean data for use in upstream models.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (non-blocking): I think this statement can cause confusion, since we usually apply a QUALIFY ... ROW_NUMBER clause in the base layer to remove duplicates. Because it uses a window function, I think we could word this item under interim as "window functions (except for treating duplicates)" or similar?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this in our next meeting? I'm usually reluctant to have QUALIFY in the base, but would be interested in hearing your experience :)

Comment thread DBT_GUIDELINES.md

#### Model-level documentation

The model-level documentation should live in the `<model_name>.yml` file in the related `<path>_schemas` folder.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea (non-blocking) Referring to a previous comment, I think having it on the same folder can be easier to work with despite the larger volume of files. So instead of storing the in the <path>_schemas folder, we would have them side by side with the SQL file

Comment thread DBT_GUIDELINES.md

* Reporting core tables are categorized between facts and dimensions with a prefix that indicates either, such as: `fact_orders` or `dim_customers`

* Table names should reflect granularity e.g. `orders` should have one order per row and `daily_orders` should have one day per row.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (blocking): In previous projects I would use the rpt_ prefix for models derived from the combination of fact and dim tables. Should we be explicit and anticipate this situation here?

It would be something like rpt_daily_mrr_per_user

Comment thread DBT_GUIDELINES.md Outdated

* Non-core reporting tables should refer to specific reports or KPIs that are used for the visualisation tool

* Export tables hold data will be loaded into third-party tools via a reverse ETL process (db service users of the tools should only have access to that schema)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (blocking): Let's add here the recommended prefix for the export models as well - probably export_?

Comment thread DBT_GUIDELINES.md

* CTEs that are duplicated across models should be pulled out into their own models.

* Only `base_` models should:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (blocking): Can we add an item about treating duplicates? In my experience this is one topic that causes problems in different client setups and is relatively easy to solve (make a "defensive" programming in the base layer against spurious records that might come)

Copy link
Copy Markdown

@elenalfo elenalfo Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpillmann What would be your suggestion here? (I personally generally try to avoid using qualify because I'm scared I'll miss some issues in the data)

Comment thread DBT_GUIDELINES.md

> Guideline: When approaching testing strategy, keep in mind the following goals:
> - Proactive Issue Detection: Catch data issues early by implementing targeted tests at key stages of your pipeline.
> - Targeted Ownership & Triage: Route alerts directly to the person or team best equipped to investigate and resolve the issue.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!

Comment thread DBT_GUIDELINES.md

* CTEs that are duplicated across models should be pulled out into their own models.

* Only `base_` models should:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion/discussion (non-blocking): I'm my experience it is also very useful to create SKs in the base layer (using dbt_utils.generate_surrogate_key), to be used downstream instead of the natural keys. This is useful to have a uniform sets of identifiers along the project (e.g. all MD5 hashes) to avoid issues like numeric IDs, among other things. E.g. instead of the user_id from the source system, we would use a user_sk in the whole project

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add section explaining this approach @lpillmann

Comment thread DBT_GUIDELINES.md
In most cases, **snapshots should be built directly on raw source tables** (`source(...)`).
This ensures you capture the original state of the data without relying on any upstream transformation logic, which could break or evolve over time.

> Cleaning or correcting a broken snapshot is very tricky, it's better to snapshot cleanly and simply from the start.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (non-blocking): Maybe for the next iteration, do we have a set of sane snapshots configurations that we would recommend? E.g. use the new_record strategy for deleted records, which columns to check, etc?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I think snapshot startegies are super relevent, probably a topic to align on :)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpillmann to elaborate a bit on the snapshots :)

Comment thread README.md
* Rename fields if source tables do not adhere to these naming conventions
* Every table must define its primary key explicitly from the start using a descriptive name like `user_id`, `item_id`, etc.
- **id fields should be of type STRING** (always cast them in the base layer!)
- for tables with composite ids (ex. day and channel) use `CONCAT_WS` or the `dbt_utils.generate_surrogate_key` macro
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to a previous comment of mine about SKs. I would suggest always creating SKs using that macro, even when the granularity is represented by a single column e.g. user_id becomes user_sk

Comment thread DBT_GUIDELINES.md
| ├── _interim_docs
| | └── interim_customers.md
| ├── _interim_schemas
| | └── interim_customers.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reminder to change: This should be .yml

Comment thread DBT_GUIDELINES.md
| | | ├── _fact_docs
| | | | └── fact_orders.md
| | | ├── _fact_schemas
| | | | └── fact_orders.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reminder to change: This should be .yml

Comment thread DBT_GUIDELINES.md
- Modularity: Each model does one thing well.
- Debuggability: Easier to test and trace logic in isolation.

The folder structure should be:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an example

Comment thread DBT_GUIDELINES.md
- The `base_<source>_schemas folder`, containing the yml files of the models
- The `sources.yml` files for that source

### interim/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to derived

Comment thread DBT_GUIDELINES.md
└── test_orders_tax_amount_discrepancy.sql

```
All transformation models are housed in the models/ folder and follow a layered structure that enforces a clean, directional flow of data. Models should flow from base → interim → reporting → export, with no reverse dependencies.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(add exceptions exist)

Comment thread DBT_GUIDELINES.md

It does not cover SQL & jinja style, which are covered by the [Gemma SQL Style Guide](https://github.com/Gemma-Analytics/gemma-sql-style/blob/12c7ee25428cc336423acc224a189dba47f7f002/README.md)

## Table of contents
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add (or update) dbt project yml section

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to best practice repo

- Fix typos: raw, folder, field-level, duplicate "to", "that will be"
- Fix broken ToC anchor link for In-line documentation
- Promote Documentation to top-level section (## instead of ###)
- Normalize subsection heading levels under Documentation (### instead of ####)
- Standardize all bullet points to * style
- Replace all ex. with e.g.
- Capitalize list items in Tests best practices

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants