-
Notifications
You must be signed in to change notification settings - Fork 380
Description
Description
I have identified a critical bug where dataset_fields are duplicated indefinitely if the field type is null (or omitted). This behavior triggers a cascading, exponential explosion of rows in the column_lineage table, leading to severe database performance degradation.
Root Cause
The issue stems from the INSERT statement for dataset_fields within the PostgreSQL ON CONFLICT clause:
INSERT INTO dataset_fields ...
ON CONFLICT(dataset_uuid, name, type) DO UPDATE ...In SQL standards, NULL != NULL. Therefore, if a field's type is provided as NULL, the Unique Constraint fails to identify the existing row. As a result, Marquez treats every incoming event for such fields as a "new" entry and performs an INSERT instead of an UPDATE.
Impact: Exponential Data Explosion
This bug turns a linear data ingestion process into an exponential one.
-
Field Duplication:
- A single field (e.g.,
modified_at) can create thousands of duplicate rows in thedataset_fieldstable over time if the type is consistently null.
- A single field (e.g.,
-
Lineage Explosion (The Multiplier Effect):
- When a job runs to link this duplicated input field to an output field, Marquez generates lineage edges for ALL duplicates.
- Expected: 1 Input Field UUID → 1 Output Field UUID = 1 row in
column_lineage. - Actual: N Input Field UUIDs → 1 Output Field UUID = N rows in
column_lineage.
-
Consequence:
- In my environment, I observed that the
column_lineagetable contained over 1,000x more rows than necessary. - Nearly 99.9% of the lineage data was garbage caused by this duplication.
- This massive overhead causes API timeouts when querying lineage and significantly increases database storage usage.
- In my environment, I observed that the
Steps to Reproduce
- Emit an OpenLineage event for a dataset where a field's
typeis explicitlynull. - Emit the same event multiple times (e.g., 10 times).
- Check the
dataset_fieldstable; you will see 10 duplicate rows for the same field name. - Emit a
columnLineageevent involving that field. - Check the
column_lineagetable; you will see 10 lineage rows created for what should be a single relationship.
Suggested Fix
Modify the INSERT query in the DAO layer to handle NULL types explicitly.
Using COALESCE(:type, 'UNKNOWN') (or a similar default value) ensures that the ON CONFLICT clause correctly matches existing rows and performs an update instead of a duplicate insert.
| DatasetFieldRow upsert( |