Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions contrib/databricks_ingestion_monitoring/CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
DATABRICKS INGESTION MONITORING DABS
====================================

0.3.4

- Add support for pipeline discovery using pipeline tags
- Enhance AI/BI dashboards to support pipeline selection using tags

0.3.3

- Add support for monitoring expectation check results
- Extend `table_events_metrics` with a new column `num_expectation_dropped_records` that contains the number of rows dropped by expectations
- Add a table `table_events_expectation_checks` which contains the number of rows that passed or failed specific expectation checks
- Update the generic SDP dashboard to expose metrics/visualizations about expectation failures.
- Bugfixes in the Datadog sink

0.3.2

- All monitoring ETL pipelines are now configured to write their event logs to the monitoring schema so that the monitoring pipelines can also be monitored. For example, the CDC Monitoring ETL pipeline will write its event log into `{monitoring_catalog}.{monitoring_schema}.cdc_connector_monitoring_etl_event_log` and the Generic SDP monitoring ETL pipeline will write its event log into `{monitoring_catalog}.{monitoring_schema}.generic_sdp_monitoring_etl_event_log`.
- Added a fix for an issue that would cause the Monitoring ETL pipelines to periodically get stuck on `flow_targets` update.


0.3.1

- Fix an issue with pipelines execution time graph across DABs
137 changes: 137 additions & 0 deletions contrib/databricks_ingestion_monitoring/COMMON_CONFIGURATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Common Configuration Guide

This document describes common configuration parameters shared among monitoring DABs (Databricks Asset Bundles).

Configuration is done through variables in a DAB deployment target.


## Required: Specify Monitoring Catalog and Schema

Configure `monitoring_catalog` and `monitoring_schema` to specify where the monitoring tables will be created. The catalog must already exist, but the schema will be created automatically if it doesn't exist.


## Required: Specify Pipelines to Monitor

Configuring which pipelines to monitor involves two steps:
1. Choose the method to extract pipeline event logs
2. Identify which pipelines to monitor

### Event Log Extraction Methods

There are two methods to extract a pipeline's event logs:

**Ingesting (Preferred)**
- Extracts event logs directly from a Delta table where the pipeline writes its logs
- Available for pipelines configured with the `event_log` field ([see documentation](https://docs.databricks.com/api/workspace/pipelines/update#event_log))
- Any UC-enabled pipeline using `catalog` and `schema` fields can be configured to store its event log in a Delta table
- Lower cost and better performance than importing

**Importing (Alternative)**
- First imports the pipeline's event log into a Delta table, then extracts from there
- More expensive operation compared to ingesting
- Use only for UC pipelines that use the legacy `catalog`/`target` configuration style
- Requires configuring dedicated import jobs (see ["Optional: Configure Event Log Import Job(s)"](#optional-configure-event-log-import-jobs))

### Pipeline Identification Methods

For both ingested and imported event logs, you can identify pipelines using:

**1. Direct Pipeline IDs**
- Use `directly_monitored_pipeline_ids` for ingested event logs
- Use `imported_pipeline_ids` for imported event logs
- Format: Comma-separated list of pipeline IDs

**2. Pipeline Tags**
- Use `directly_monitored_pipeline_tags` for ingested event logs
- Use `imported_pipeline_tags` for imported event logs
- Format: Semi-colon-separated lists of comma-separated `tag[:value]` pairs
- **Semicolons (`;`)** = OR logic - pipelines matching ANY list will be selected
- **Commas (`,`)** = AND logic - pipelines matching ALL tags in the list will be selected
- `tag` without a value is equivalent to `tag:` (empty value)

**Example:**
```
directly_monitored_pipeline_tags: "tier:T0;team:data,tier:T1"
```
This selects pipelines with either:
- Tag `tier:T0`, OR
- Tags `team:data` AND `tier:T1`

**Combining Methods:**
All pipeline identification methods can be used together. Pipelines matching any criteria will be included.

> **Performance Tip:** For workspaces with hundreds or thousands of pipelines, enable pipeline tags indexing to significantly speed up tag-based discovery. See ["Optional: Configure Pipelines Tags Indexing Job"](#optional-configure-pipelines-tags-indexing-job) for more information.


## Optional: Monitoring ETL Pipeline Configuration

**Schedule Configuration:**
- Customize the monitoring ETL pipeline schedule using the `monitoring_etl_cron_schedule` variable
- Default: Runs hourly
- Trade-off: Higher frequency increases data freshness but also increases DBU costs

For additional configuration options, refer to the `variables` section in the `databricks.yml` file for the DAB containing the monitoring ETL pipeline.


## Optional: Configure Event Log Import Job(s)

> **Note:** Only needed if you're using the "Importing" event log extraction method.

**Basic Configuration:**
1. Set `import_event_log_schedule_state` to `UNPAUSED`
- Default schedule: Hourly (configurable via `import_event_log_cron_schedule`)

2. Configure the `imported_event_log_tables` variable in the monitoring ETL pipeline
- Specify the table name(s) where imported logs are stored
- You can reference `${var.imported_event_logs_table_name}`
- Multiple tables can be specified as a comma-separated list

**Handling Pipeline Ownership:**
- If monitored pipelines have a different owner than the DAB owner:
- Edit `common/resources/import_event_logs.job.yml`
- Uncomment the `run_as` principal lines
- Specify the appropriate principal

- If multiple sets of pipelines have different owners:
- Duplicate the job definition in `common/resources/import_event_logs.job.yml`
- Give each job a unique name
- Configure the `run_as` principal for each job as needed
- All jobs can share the same target table (`imported_event_logs_table_name`)

See [common/vars/import_event_logs.vars.yml](common/vars/import_event_logs.vars.yml) for detailed configuration variable descriptions.


## Optional: Configure Pipelines Tags Indexing Job

> **When to use:** For large-scale deployments with hundreds or thousands of pipelines using tag-based identification.

**Why indexing matters:**
Tag-based pipeline discovery requires fetching metadata for every pipeline via the Databricks API on each event log import and monitoring ETL execution. For large deployments, this can be slow and expensive. The tags index caches this information to significantly improve performance.

**Configuration Steps:**

1. **Enable the index:**
- Set `pipeline_tags_index_enabled` to `true`

2. **Enable the index refresh job:**
- Set `pipeline_tags_index_schedule_state` to `UNPAUSED`
- This job periodically refreshes the index to keep it up-to-date

3. **Optional: Customize refresh schedule**
- Configure `pipeline_tags_index_cron_schedule` (default: daily)
- If you change the schedule, consider adjusting `pipeline_tags_index_max_age_hours` (default: 48 hours)
- When the index is older than the max age threshold, the system falls back to API-based discovery

See [common/vars/pipeline_tags_index.vars.yml](common/vars/pipeline_tags_index.vars.yml) for detailed configuration variable descriptions.

> **Notes:**
1. The system gracefully falls back to API-based discovery if the index is disabled, unavailable, or stale.
2. If a recently created or tagged pipeline is missing from the monitoring ETL output, this can be due to the staleness of the index. Run the corresponding `Build *** pipeline tags index` job to refresh the index and re-run the monitoring ETL pipeline.


## Optional: Configure Third-Party Monitoring Integration

You can export monitoring data to third-party monitoring platforms such as Datadog, Splunk, New Relic, or Azure Monitor.

See [README-third-party-monitoring.md](README-third-party-monitoring.md) for detailed configuration instructions.

51 changes: 51 additions & 0 deletions contrib/databricks_ingestion_monitoring/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
DB license

Copyright (2022) Databricks, Inc.

Definitions.

Agreement: The agreement between Databricks, Inc., and you governing the use of the Databricks Services, which shall
be, with respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with
respect to Databricks Community Edition, the Community Edition Terms of Service located at
www.databricks.com/ce-termsofuse, in each case unless you have entered into a separate written agreement with
Databricks governing the use of the applicable Databricks Services.

Software: The source code and object code to which this license applies.

Scope of Use. You may not use this Software except in connection with your use of the Databricks Services pursuant to
the Agreement. Your use of the Software must comply at all times with any restrictions applicable to the Databricks
Services, generally, and must be used in accordance with any applicable documentation. You may view, use, copy,
modify, publish, and/or distribute the Software solely for the purposes of using the code within or connecting to the
Databricks Services. If you do not agree to these terms, you may not view, use, copy, modify, publish, and/or
distribute the Software.

Redistribution. You may redistribute and sublicense the Software so long as all use is in compliance with these terms.
In addition:

You must give any other recipients a copy of this License;
You must cause any modified files to carry prominent notices stating that you changed the files;
You must retain, in the source code form of any derivative works that you distribute, all copyright, patent,
trademark, and attribution notices from the source code form, excluding those notices that do not pertain to any part
of the derivative works; and
If the source code form includes a "NOTICE" text file as part of its distribution, then any derivative works that you
distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those
notices that do not pertain to any part of the derivative works.
You may add your own copyright statement to your modifications and may provide additional license terms and conditions
for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided
your use, reproduction, and distribution of the Software otherwise complies with the conditions stated in this
License.

Termination. This license terminates automatically upon your breach of these terms or upon the termination of your
Agreement. Additionally, Databricks may terminate this license at any time on notice. Upon termination, you must
permanently delete the Software and all copies thereof.

DISCLAIMER; LIMITATION OF LIABILITY.

THE SOFTWARE IS PROVIDED “AS-IS” AND WITH ALL FAULTS. DATABRICKS, ON BEHALF OF ITSELF AND ITS LICENSORS, SPECIFICALLY
DISCLAIMS ALL WARRANTIES RELATING TO THE SOURCE CODE, EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, IMPLIED
WARRANTIES, CONDITIONS AND OTHER TERMS OF MERCHANTABILITY, SATISFACTORY QUALITY OR FITNESS FOR A PARTICULAR PURPOSE,
AND NON-INFRINGEMENT. DATABRICKS AND ITS LICENSORS TOTAL AGGREGATE LIABILITY RELATING TO OR ARISING OUT OF YOUR USE OF
OR DATABRICKS’ PROVISIONING OF THE SOURCE CODE SHALL BE LIMITED TO ONE THOUSAND ($1,000) DOLLARS. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
17 changes: 17 additions & 0 deletions contrib/databricks_ingestion_monitoring/NOTICE
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Copyright (2025) Databricks, Inc.

This Software includes software developed at Databricks (https://www.databricks.com/) and its use is subject to the included LICENSE file.

__________
This Software contains code from the following open source projects, licensed under the Apache 2.0 license (https://www.apache.org/licenses/LICENSE-2.0):

requests - https://pypi.org/project/requests/
Copyright 2019 Kenneth Reitz

tenacity - https://pypi.org/project/tenacity/
Copyright Julien Danjou

pyspark - https://pypi.org/project/pyspark/
Copyright 2014 and onwards The Apache Software Foundation.


Loading