Add dbt guidelines as separate file by soltanianalytics · Pull Request #13 · Gemma-Analytics/gemma-sql-style

soltanianalytics · 2025-07-29T14:07:46Z

No description provided.

lpillmann

I like the additions a lot! I think it is a clear and easy read overall.

I left a few comments, some of them are more relevant. Feel free to ignore the non-blocking ones :)

lpillmann · 2025-07-30T15:45:04Z

+There should be one folder per source with:
+- The related base models
+- The `base_<source>_docs foder`, containing the md files of the models
+- The `base_<source>_schemas folder`, containing the yml files of the models
+- The `sources.yml` files for that source


Idea (non-blocking): at HomeToGo/Smoobu the convention was one YAML file per model, e.g.

/models/base/base_chargebee__invoices.sql /models/base/base_chargebee__invoices.yml

I found it much easier to work with than when it is all combined in a single schemas.yml, because no CTRL+F is needed. You simply search for the filename it you have it

I heard some other people considering this approach at Gemma, so I'm bringing it for discussion here. What are your thoughts?

Yess, that's exactly what i want to do! I definitely should explain it better though haha will add a comment :)

@lpillmann do you have a strong opinion about having folders for schemas and md files? (like 1 file x model but in a subfolder) like:

lpillmann · 2025-07-30T15:55:10Z

+  * If you need more complex transformations to make the source useful, consider adding an `interim_` table.
+
+* Only `interim_` models should:
+  * use aggregates, window functions, joins necessary to clean data for use in upstream models.


Suggestion (non-blocking): I think this statement can cause confusion, since we usually apply a QUALIFY ... ROW_NUMBER clause in the base layer to remove duplicates. Because it uses a window function, I think we could word this item under interim as "window functions (except for treating duplicates)" or similar?

Let's discuss this in our next meeting? I'm usually reluctant to have QUALIFY in the base, but would be interested in hearing your experience :)

lpillmann · 2025-07-30T15:56:48Z

+
+#### Model-level documentation
+
+The model-level documentation should live in the `<model_name>.yml` file in the related `<path>_schemas` folder.


Idea (non-blocking) Referring to a previous comment, I think having it on the same folder can be easier to work with despite the larger volume of files. So instead of storing the in the <path>_schemas folder, we would have them side by side with the SQL file

lpillmann · 2025-07-30T15:59:53Z

+
+* Reporting core tables are categorized between facts and dimensions with a prefix that indicates either, such as: `fact_orders` or `dim_customers`
+
+* Table names should reflect granularity e.g. `orders` should have one order per row and `daily_orders` should have one day per row.


Suggestion (blocking): In previous projects I would use the rpt_ prefix for models derived from the combination of fact and dim tables. Should we be explicit and anticipate this situation here?

It would be something like rpt_daily_mrr_per_user

lpillmann · 2025-07-30T16:00:09Z

+
+* Non-core reporting tables should refer to specific reports or KPIs that are used for the visualisation tool
+
+* Export tables hold data will be loaded into third-party tools via a reverse ETL process (db service users of the tools should only have access to that schema)


Suggestion (blocking): Let's add here the recommended prefix for the export models as well - probably export_?

lpillmann · 2025-07-30T16:05:10Z

+
+* CTEs that are duplicated across models should be pulled out into their own models.
+
+* Only `base_` models should:


Suggestion (blocking): Can we add an item about treating duplicates? In my experience this is one topic that causes problems in different client setups and is relatively easy to solve (make a "defensive" programming in the base layer against spurious records that might come)

@lpillmann What would be your suggestion here? (I personally generally try to avoid using qualify because I'm scared I'll miss some issues in the data)

lpillmann · 2025-07-30T16:05:32Z

+
+> Guideline: When approaching testing strategy, keep in mind the following goals:
+> - Proactive Issue Detection: Catch data issues early by implementing targeted tests at key stages of your pipeline.
+> - Targeted Ownership & Triage: Route alerts directly to the person or team best equipped to investigate and resolve the issue.


lpillmann · 2025-07-30T16:09:26Z

+
+* CTEs that are duplicated across models should be pulled out into their own models.
+
+* Only `base_` models should:


Suggestion/discussion (non-blocking): I'm my experience it is also very useful to create SKs in the base layer (using dbt_utils.generate_surrogate_key), to be used downstream instead of the natural keys. This is useful to have a uniform sets of identifiers along the project (e.g. all MD5 hashes) to avoid issues like numeric IDs, among other things. E.g. instead of the user_id from the source system, we would use a user_sk in the whole project

add section explaining this approach @lpillmann

lpillmann · 2025-07-30T16:11:22Z

+In most cases, **snapshots should be built directly on raw source tables** (`source(...)`).  
+This ensures you capture the original state of the data without relying on any upstream transformation logic, which could break or evolve over time.
+
+> Cleaning or correcting a broken snapshot is very tricky, it's better to snapshot cleanly and simply from the start.


Suggestion (non-blocking): Maybe for the next iteration, do we have a set of sane snapshots configurations that we would recommend? E.g. use the new_record strategy for deleted records, which columns to check, etc?

Yes! I think snapshot startegies are super relevent, probably a topic to align on :)

@lpillmann to elaborate a bit on the snapshots :)

lpillmann · 2025-07-30T16:13:23Z

  * Rename fields if source tables do not adhere to these naming conventions
  * Every table must define its primary key explicitly from the start using a descriptive name like `user_id`, `item_id`, etc.
+    - **id fields should be of type STRING** (always cast them in the base layer!)
+    - for tables with composite ids (ex. day and channel) use `CONCAT_WS` or the `dbt_utils.generate_surrogate_key` macro


This relates to a previous comment of mine about SKs. I would suggest always creating SKs using that macro, even when the granularity is represented by a single column e.g. user_id becomes user_sk

elenalfo · 2025-07-30T17:52:55Z

+    |       ├── _interim_docs
+    |       |   └── interim_customers.md
+    |       ├── _interim_schemas
+    |       |   └── interim_customers.md


Self-reminder to change: This should be .yml

elenalfo · 2025-07-30T17:53:24Z

+    |       |   |   ├── _fact_docs
+    |       |   |   |   └── fact_orders.md
+    |       |   |   ├── _fact_schemas
+    |       |   |   |   └── fact_orders.md


Self-reminder to change: This should be .yml

elenalfo · 2025-09-23T15:17:35Z

+- Modularity: Each model does one thing well.
+- Debuggability: Easier to test and trace logic in isolation.
+
+The folder structure should be:


this should be an example

elenalfo · 2025-09-23T15:17:52Z

+- The `base_<source>_schemas folder`, containing the yml files of the models
+- The `sources.yml` files for that source
+
+### interim/


change to derived

elenalfo · 2025-09-23T15:23:46Z

+        └── test_orders_tax_amount_discrepancy.sql
+
+```
+All transformation models are housed in the models/ folder and follow a layered structure that enforces a clean, directional flow of data. Models should flow from base → interim → reporting → export, with no reverse dependencies.


(add exceptions exist)

elenalfo · 2025-09-23T16:01:48Z

+
+It does not cover SQL & jinja style, which are covered by the [Gemma SQL Style Guide](https://github.com/Gemma-Analytics/gemma-sql-style/blob/12c7ee25428cc336423acc224a189dba47f7f002/README.md)
+
+## Table of contents


add (or update) dbt project yml section

link to best practice repo

- Fix typos: raw, folder, field-level, duplicate "to", "that will be" - Fix broken ToC anchor link for In-line documentation - Promote Documentation to top-level section (## instead of ###) - Normalize subsection heading levels under Documentation (### instead of ####) - Standardize all bullet points to * style - Replace all ex. with e.g. - Capitalize list items in Tests best practices Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add dbt guidelines as separate file

e494190

soltanianalytics requested review from elenalfo and lpillmann July 29, 2025 14:07

lpillmann reviewed Jul 30, 2025

View reviewed changes

elenalfo reviewed Jul 30, 2025

View reviewed changes

elenalfo reviewed Sep 23, 2025

View reviewed changes


		#### Model-level documentation

		The model-level documentation should live in the `<model_name>.yml` file in the related `<path>_schemas` folder.


		* Reporting core tables are categorized between facts and dimensions with a prefix that indicates either, such as: `fact_orders` or `dim_customers`

		* Table names should reflect granularity e.g. `orders` should have one order per row and `daily_orders` should have one day per row.


		* Non-core reporting tables should refer to specific reports or KPIs that are used for the visualisation tool

		* Export tables hold data will be loaded into third-party tools via a reverse ETL process (db service users of the tools should only have access to that schema)


		* CTEs that are duplicated across models should be pulled out into their own models.

		* Only `base_` models should:


		It does not cover SQL & jinja style, which are covered by the [Gemma SQL Style Guide](https://github.com/Gemma-Analytics/gemma-sql-style/blob/12c7ee25428cc336423acc224a189dba47f7f002/README.md)

		## Table of contents

Conversation

soltanianalytics commented Jul 29, 2025

Uh oh!

lpillmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elenalfo Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elenalfo Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elenalfo Aug 21, 2025 •

edited

Loading

elenalfo Aug 21, 2025 •

edited

Loading