-
Notifications
You must be signed in to change notification settings - Fork 99
dbx_ingestion_monitoring #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cbotev-databricks
wants to merge
3
commits into
databricks:main
Choose a base branch
from
cbotev-databricks:dbx_ingestion_monitoring_0.3.4
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+25,897
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| DATABRICKS INGESTION MONITORING DABS | ||
| ==================================== | ||
|
|
||
| 0.3.4 | ||
|
|
||
| - Add support for pipeline discovery using pipeline tags | ||
| - Enhance AI/BI dashboards to support pipeline selection using tags | ||
|
|
||
| 0.3.3 | ||
|
|
||
| - Add support for monitoring expectation check results | ||
| - Extend `table_events_metrics` with a new column `num_expectation_dropped_records` that contains the number of rows dropped by expectations | ||
| - Add a table `table_events_expectation_checks` which contains the number of rows that passed or failed specific expectation checks | ||
| - Update the generic SDP dashboard to expose metrics/visualizations about expectation failures. | ||
| - Bugfixes in the Datadog sink | ||
|
|
||
| 0.3.2 | ||
|
|
||
| - All monitoring ETL pipelines are now configured to write their event logs to the monitoring schema so that the monitoring pipelines can also be monitored. For example, the CDC Monitoring ETL pipeline will write its event log into `{monitoring_catalog}.{monitoring_schema}.cdc_connector_monitoring_etl_event_log` and the Generic SDP monitoring ETL pipeline will write its event log into `{monitoring_catalog}.{monitoring_schema}.generic_sdp_monitoring_etl_event_log`. | ||
| - Added a fix for an issue that would cause the Monitoring ETL pipelines to periodically get stuck on `flow_targets` update. | ||
|
|
||
|
|
||
| 0.3.1 | ||
|
|
||
| - Fix an issue with pipelines execution time graph across DABs |
137 changes: 137 additions & 0 deletions
137
contrib/databricks_ingestion_monitoring/COMMON_CONFIGURATION.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,137 @@ | ||
| # Common Configuration Guide | ||
|
|
||
| This document describes common configuration parameters shared among monitoring DABs (Databricks Asset Bundles). | ||
|
|
||
| Configuration is done through variables in a DAB deployment target. | ||
|
|
||
|
|
||
| ## Required: Specify Monitoring Catalog and Schema | ||
|
|
||
| Configure `monitoring_catalog` and `monitoring_schema` to specify where the monitoring tables will be created. The catalog must already exist, but the schema will be created automatically if it doesn't exist. | ||
|
|
||
|
|
||
| ## Required: Specify Pipelines to Monitor | ||
|
|
||
| Configuring which pipelines to monitor involves two steps: | ||
| 1. Choose the method to extract pipeline event logs | ||
| 2. Identify which pipelines to monitor | ||
|
|
||
| ### Event Log Extraction Methods | ||
|
|
||
| There are two methods to extract a pipeline's event logs: | ||
|
|
||
| **Ingesting (Preferred)** | ||
| - Extracts event logs directly from a Delta table where the pipeline writes its logs | ||
| - Available for pipelines configured with the `event_log` field ([see documentation](https://docs.databricks.com/api/workspace/pipelines/update#event_log)) | ||
| - Any UC-enabled pipeline using `catalog` and `schema` fields can be configured to store its event log in a Delta table | ||
| - Lower cost and better performance than importing | ||
|
|
||
| **Importing (Alternative)** | ||
| - First imports the pipeline's event log into a Delta table, then extracts from there | ||
| - More expensive operation compared to ingesting | ||
| - Use only for UC pipelines that use the legacy `catalog`/`target` configuration style | ||
| - Requires configuring dedicated import jobs (see ["Optional: Configure Event Log Import Job(s)"](#optional-configure-event-log-import-jobs)) | ||
|
|
||
| ### Pipeline Identification Methods | ||
|
|
||
| For both ingested and imported event logs, you can identify pipelines using: | ||
|
|
||
| **1. Direct Pipeline IDs** | ||
| - Use `directly_monitored_pipeline_ids` for ingested event logs | ||
| - Use `imported_pipeline_ids` for imported event logs | ||
| - Format: Comma-separated list of pipeline IDs | ||
|
|
||
| **2. Pipeline Tags** | ||
| - Use `directly_monitored_pipeline_tags` for ingested event logs | ||
| - Use `imported_pipeline_tags` for imported event logs | ||
| - Format: Semi-colon-separated lists of comma-separated `tag[:value]` pairs | ||
| - **Semicolons (`;`)** = OR logic - pipelines matching ANY list will be selected | ||
| - **Commas (`,`)** = AND logic - pipelines matching ALL tags in the list will be selected | ||
| - `tag` without a value is equivalent to `tag:` (empty value) | ||
|
|
||
| **Example:** | ||
| ``` | ||
| directly_monitored_pipeline_tags: "tier:T0;team:data,tier:T1" | ||
| ``` | ||
| This selects pipelines with either: | ||
| - Tag `tier:T0`, OR | ||
| - Tags `team:data` AND `tier:T1` | ||
|
|
||
| **Combining Methods:** | ||
| All pipeline identification methods can be used together. Pipelines matching any criteria will be included. | ||
|
|
||
| > **Performance Tip:** For workspaces with hundreds or thousands of pipelines, enable pipeline tags indexing to significantly speed up tag-based discovery. See ["Optional: Configure Pipelines Tags Indexing Job"](#optional-configure-pipelines-tags-indexing-job) for more information. | ||
|
|
||
|
|
||
| ## Optional: Monitoring ETL Pipeline Configuration | ||
|
|
||
| **Schedule Configuration:** | ||
| - Customize the monitoring ETL pipeline schedule using the `monitoring_etl_cron_schedule` variable | ||
| - Default: Runs hourly | ||
| - Trade-off: Higher frequency increases data freshness but also increases DBU costs | ||
|
|
||
| For additional configuration options, refer to the `variables` section in the `databricks.yml` file for the DAB containing the monitoring ETL pipeline. | ||
|
|
||
|
|
||
| ## Optional: Configure Event Log Import Job(s) | ||
|
|
||
| > **Note:** Only needed if you're using the "Importing" event log extraction method. | ||
|
|
||
| **Basic Configuration:** | ||
| 1. Set `import_event_log_schedule_state` to `UNPAUSED` | ||
| - Default schedule: Hourly (configurable via `import_event_log_cron_schedule`) | ||
|
|
||
| 2. Configure the `imported_event_log_tables` variable in the monitoring ETL pipeline | ||
| - Specify the table name(s) where imported logs are stored | ||
| - You can reference `${var.imported_event_logs_table_name}` | ||
| - Multiple tables can be specified as a comma-separated list | ||
|
|
||
| **Handling Pipeline Ownership:** | ||
| - If monitored pipelines have a different owner than the DAB owner: | ||
| - Edit `common/resources/import_event_logs.job.yml` | ||
| - Uncomment the `run_as` principal lines | ||
| - Specify the appropriate principal | ||
|
|
||
| - If multiple sets of pipelines have different owners: | ||
| - Duplicate the job definition in `common/resources/import_event_logs.job.yml` | ||
| - Give each job a unique name | ||
| - Configure the `run_as` principal for each job as needed | ||
| - All jobs can share the same target table (`imported_event_logs_table_name`) | ||
|
|
||
| See [common/vars/import_event_logs.vars.yml](common/vars/import_event_logs.vars.yml) for detailed configuration variable descriptions. | ||
|
|
||
|
|
||
| ## Optional: Configure Pipelines Tags Indexing Job | ||
|
|
||
| > **When to use:** For large-scale deployments with hundreds or thousands of pipelines using tag-based identification. | ||
|
|
||
| **Why indexing matters:** | ||
| Tag-based pipeline discovery requires fetching metadata for every pipeline via the Databricks API on each event log import and monitoring ETL execution. For large deployments, this can be slow and expensive. The tags index caches this information to significantly improve performance. | ||
|
|
||
| **Configuration Steps:** | ||
|
|
||
| 1. **Enable the index:** | ||
| - Set `pipeline_tags_index_enabled` to `true` | ||
|
|
||
| 2. **Enable the index refresh job:** | ||
| - Set `pipeline_tags_index_schedule_state` to `UNPAUSED` | ||
| - This job periodically refreshes the index to keep it up-to-date | ||
|
|
||
| 3. **Optional: Customize refresh schedule** | ||
| - Configure `pipeline_tags_index_cron_schedule` (default: daily) | ||
| - If you change the schedule, consider adjusting `pipeline_tags_index_max_age_hours` (default: 48 hours) | ||
| - When the index is older than the max age threshold, the system falls back to API-based discovery | ||
|
|
||
| See [common/vars/pipeline_tags_index.vars.yml](common/vars/pipeline_tags_index.vars.yml) for detailed configuration variable descriptions. | ||
|
|
||
| > **Notes:** | ||
| 1. The system gracefully falls back to API-based discovery if the index is disabled, unavailable, or stale. | ||
| 2. If a recently created or tagged pipeline is missing from the monitoring ETL output, this can be due to the staleness of the index. Run the corresponding `Build *** pipeline tags index` job to refresh the index and re-run the monitoring ETL pipeline. | ||
|
|
||
|
|
||
| ## Optional: Configure Third-Party Monitoring Integration | ||
|
|
||
| You can export monitoring data to third-party monitoring platforms such as Datadog, Splunk, New Relic, or Azure Monitor. | ||
|
|
||
| See [README-third-party-monitoring.md](README-third-party-monitoring.md) for detailed configuration instructions. | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| DB license | ||
|
|
||
| Copyright (2022) Databricks, Inc. | ||
|
|
||
| Definitions. | ||
|
|
||
| Agreement: The agreement between Databricks, Inc., and you governing the use of the Databricks Services, which shall | ||
| be, with respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with | ||
| respect to Databricks Community Edition, the Community Edition Terms of Service located at | ||
| www.databricks.com/ce-termsofuse, in each case unless you have entered into a separate written agreement with | ||
| Databricks governing the use of the applicable Databricks Services. | ||
|
|
||
| Software: The source code and object code to which this license applies. | ||
|
|
||
| Scope of Use. You may not use this Software except in connection with your use of the Databricks Services pursuant to | ||
| the Agreement. Your use of the Software must comply at all times with any restrictions applicable to the Databricks | ||
| Services, generally, and must be used in accordance with any applicable documentation. You may view, use, copy, | ||
| modify, publish, and/or distribute the Software solely for the purposes of using the code within or connecting to the | ||
| Databricks Services. If you do not agree to these terms, you may not view, use, copy, modify, publish, and/or | ||
| distribute the Software. | ||
|
|
||
| Redistribution. You may redistribute and sublicense the Software so long as all use is in compliance with these terms. | ||
| In addition: | ||
|
|
||
| You must give any other recipients a copy of this License; | ||
| You must cause any modified files to carry prominent notices stating that you changed the files; | ||
| You must retain, in the source code form of any derivative works that you distribute, all copyright, patent, | ||
| trademark, and attribution notices from the source code form, excluding those notices that do not pertain to any part | ||
| of the derivative works; and | ||
| If the source code form includes a "NOTICE" text file as part of its distribution, then any derivative works that you | ||
| distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those | ||
| notices that do not pertain to any part of the derivative works. | ||
| You may add your own copyright statement to your modifications and may provide additional license terms and conditions | ||
| for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided | ||
| your use, reproduction, and distribution of the Software otherwise complies with the conditions stated in this | ||
| License. | ||
|
|
||
| Termination. This license terminates automatically upon your breach of these terms or upon the termination of your | ||
| Agreement. Additionally, Databricks may terminate this license at any time on notice. Upon termination, you must | ||
| permanently delete the Software and all copies thereof. | ||
|
|
||
| DISCLAIMER; LIMITATION OF LIABILITY. | ||
|
|
||
| THE SOFTWARE IS PROVIDED “AS-IS” AND WITH ALL FAULTS. DATABRICKS, ON BEHALF OF ITSELF AND ITS LICENSORS, SPECIFICALLY | ||
| DISCLAIMS ALL WARRANTIES RELATING TO THE SOURCE CODE, EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, IMPLIED | ||
| WARRANTIES, CONDITIONS AND OTHER TERMS OF MERCHANTABILITY, SATISFACTORY QUALITY OR FITNESS FOR A PARTICULAR PURPOSE, | ||
| AND NON-INFRINGEMENT. DATABRICKS AND ITS LICENSORS TOTAL AGGREGATE LIABILITY RELATING TO OR ARISING OUT OF YOUR USE OF | ||
| OR DATABRICKS’ PROVISIONING OF THE SOURCE CODE SHALL BE LIMITED TO ONE THOUSAND ($1,000) DOLLARS. IN NO EVENT SHALL | ||
| THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF | ||
| CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
| THE SOFTWARE. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| Copyright (2025) Databricks, Inc. | ||
|
|
||
| This Software includes software developed at Databricks (https://www.databricks.com/) and its use is subject to the included LICENSE file. | ||
|
|
||
| __________ | ||
| This Software contains code from the following open source projects, licensed under the Apache 2.0 license (https://www.apache.org/licenses/LICENSE-2.0): | ||
|
|
||
| requests - https://pypi.org/project/requests/ | ||
| Copyright 2019 Kenneth Reitz | ||
|
|
||
| tenacity - https://pypi.org/project/tenacity/ | ||
| Copyright Julien Danjou | ||
|
|
||
| pyspark - https://pypi.org/project/pyspark/ | ||
| Copyright 2014 and onwards The Apache Software Foundation. | ||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.