Conversation
lpillmann
left a comment
There was a problem hiding this comment.
I like the additions a lot! I think it is a clear and easy read overall.
I left a few comments, some of them are more relevant. Feel free to ignore the non-blocking ones :)
| There should be one folder per source with: | ||
| - The related base models | ||
| - The `base_<source>_docs foder`, containing the md files of the models | ||
| - The `base_<source>_schemas folder`, containing the yml files of the models | ||
| - The `sources.yml` files for that source |
There was a problem hiding this comment.
Idea (non-blocking): at HomeToGo/Smoobu the convention was one YAML file per model, e.g.
/models/base/base_chargebee__invoices.sql
/models/base/base_chargebee__invoices.yml
I found it much easier to work with than when it is all combined in a single schemas.yml, because no CTRL+F is needed. You simply search for the filename it you have it
I heard some other people considering this approach at Gemma, so I'm bringing it for discussion here. What are your thoughts?
There was a problem hiding this comment.
Yess, that's exactly what i want to do! I definitely should explain it better though haha will add a comment :)
There was a problem hiding this comment.
@lpillmann do you have a strong opinion about having folders for schemas and md files? (like 1 file x model but in a subfolder) like:

| * If you need more complex transformations to make the source useful, consider adding an `interim_` table. | ||
|
|
||
| * Only `interim_` models should: | ||
| * use aggregates, window functions, joins necessary to clean data for use in upstream models. |
There was a problem hiding this comment.
Suggestion (non-blocking): I think this statement can cause confusion, since we usually apply a QUALIFY ... ROW_NUMBER clause in the base layer to remove duplicates. Because it uses a window function, I think we could word this item under interim as "window functions (except for treating duplicates)" or similar?
There was a problem hiding this comment.
Let's discuss this in our next meeting? I'm usually reluctant to have QUALIFY in the base, but would be interested in hearing your experience :)
|
|
||
| #### Model-level documentation | ||
|
|
||
| The model-level documentation should live in the `<model_name>.yml` file in the related `<path>_schemas` folder. |
There was a problem hiding this comment.
Idea (non-blocking) Referring to a previous comment, I think having it on the same folder can be easier to work with despite the larger volume of files. So instead of storing the in the <path>_schemas folder, we would have them side by side with the SQL file
|
|
||
| * Reporting core tables are categorized between facts and dimensions with a prefix that indicates either, such as: `fact_orders` or `dim_customers` | ||
|
|
||
| * Table names should reflect granularity e.g. `orders` should have one order per row and `daily_orders` should have one day per row. |
There was a problem hiding this comment.
Suggestion (blocking): In previous projects I would use the rpt_ prefix for models derived from the combination of fact and dim tables. Should we be explicit and anticipate this situation here?
It would be something like rpt_daily_mrr_per_user
|
|
||
| * Non-core reporting tables should refer to specific reports or KPIs that are used for the visualisation tool | ||
|
|
||
| * Export tables hold data will be loaded into third-party tools via a reverse ETL process (db service users of the tools should only have access to that schema) |
There was a problem hiding this comment.
Suggestion (blocking): Let's add here the recommended prefix for the export models as well - probably export_?
|
|
||
| * CTEs that are duplicated across models should be pulled out into their own models. | ||
|
|
||
| * Only `base_` models should: |
There was a problem hiding this comment.
Suggestion (blocking): Can we add an item about treating duplicates? In my experience this is one topic that causes problems in different client setups and is relatively easy to solve (make a "defensive" programming in the base layer against spurious records that might come)
There was a problem hiding this comment.
@lpillmann What would be your suggestion here? (I personally generally try to avoid using qualify because I'm scared I'll miss some issues in the data)
|
|
||
| > Guideline: When approaching testing strategy, keep in mind the following goals: | ||
| > - Proactive Issue Detection: Catch data issues early by implementing targeted tests at key stages of your pipeline. | ||
| > - Targeted Ownership & Triage: Route alerts directly to the person or team best equipped to investigate and resolve the issue. |
|
|
||
| * CTEs that are duplicated across models should be pulled out into their own models. | ||
|
|
||
| * Only `base_` models should: |
There was a problem hiding this comment.
Suggestion/discussion (non-blocking): I'm my experience it is also very useful to create SKs in the base layer (using dbt_utils.generate_surrogate_key), to be used downstream instead of the natural keys. This is useful to have a uniform sets of identifiers along the project (e.g. all MD5 hashes) to avoid issues like numeric IDs, among other things. E.g. instead of the user_id from the source system, we would use a user_sk in the whole project
There was a problem hiding this comment.
add section explaining this approach @lpillmann
| In most cases, **snapshots should be built directly on raw source tables** (`source(...)`). | ||
| This ensures you capture the original state of the data without relying on any upstream transformation logic, which could break or evolve over time. | ||
|
|
||
| > Cleaning or correcting a broken snapshot is very tricky, it's better to snapshot cleanly and simply from the start. |
There was a problem hiding this comment.
Suggestion (non-blocking): Maybe for the next iteration, do we have a set of sane snapshots configurations that we would recommend? E.g. use the new_record strategy for deleted records, which columns to check, etc?
There was a problem hiding this comment.
Yes! I think snapshot startegies are super relevent, probably a topic to align on :)
There was a problem hiding this comment.
@lpillmann to elaborate a bit on the snapshots :)
| * Rename fields if source tables do not adhere to these naming conventions | ||
| * Every table must define its primary key explicitly from the start using a descriptive name like `user_id`, `item_id`, etc. | ||
| - **id fields should be of type STRING** (always cast them in the base layer!) | ||
| - for tables with composite ids (ex. day and channel) use `CONCAT_WS` or the `dbt_utils.generate_surrogate_key` macro |
There was a problem hiding this comment.
This relates to a previous comment of mine about SKs. I would suggest always creating SKs using that macro, even when the granularity is represented by a single column e.g. user_id becomes user_sk
| | ├── _interim_docs | ||
| | | └── interim_customers.md | ||
| | ├── _interim_schemas | ||
| | | └── interim_customers.md |
There was a problem hiding this comment.
Self-reminder to change: This should be .yml
| | | | ├── _fact_docs | ||
| | | | | └── fact_orders.md | ||
| | | | ├── _fact_schemas | ||
| | | | | └── fact_orders.md |
There was a problem hiding this comment.
Self-reminder to change: This should be .yml
| - Modularity: Each model does one thing well. | ||
| - Debuggability: Easier to test and trace logic in isolation. | ||
|
|
||
| The folder structure should be: |
| - The `base_<source>_schemas folder`, containing the yml files of the models | ||
| - The `sources.yml` files for that source | ||
|
|
||
| ### interim/ |
| └── test_orders_tax_amount_discrepancy.sql | ||
|
|
||
| ``` | ||
| All transformation models are housed in the models/ folder and follow a layered structure that enforces a clean, directional flow of data. Models should flow from base → interim → reporting → export, with no reverse dependencies. |
|
|
||
| It does not cover SQL & jinja style, which are covered by the [Gemma SQL Style Guide](https://github.com/Gemma-Analytics/gemma-sql-style/blob/12c7ee25428cc336423acc224a189dba47f7f002/README.md) | ||
|
|
||
| ## Table of contents |
There was a problem hiding this comment.
add (or update) dbt project yml section
- Fix typos: raw, folder, field-level, duplicate "to", "that will be" - Fix broken ToC anchor link for In-line documentation - Promote Documentation to top-level section (## instead of ###) - Normalize subsection heading levels under Documentation (### instead of ####) - Standardize all bullet points to * style - Replace all ex. with e.g. - Capitalize list items in Tests best practices Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No description provided.