Skip to content

[Bug] dataset_fields with NULL type causes explosion in column_lineage #3083

@bsmin10010-cloud

Description

@bsmin10010-cloud

Description

I have identified a critical bug where dataset_fields are duplicated indefinitely if the field type is null (or omitted). This behavior triggers a cascading, exponential explosion of rows in the column_lineage table, leading to severe database performance degradation.

Root Cause

The issue stems from the INSERT statement for dataset_fields within the PostgreSQL ON CONFLICT clause:

INSERT INTO dataset_fields ...
ON CONFLICT(dataset_uuid, name, type) DO UPDATE ...

In SQL standards, NULL != NULL. Therefore, if a field's type is provided as NULL, the Unique Constraint fails to identify the existing row. As a result, Marquez treats every incoming event for such fields as a "new" entry and performs an INSERT instead of an UPDATE.

Impact: Exponential Data Explosion

This bug turns a linear data ingestion process into an exponential one.

  1. Field Duplication:

    • A single field (e.g., modified_at) can create thousands of duplicate rows in the dataset_fields table over time if the type is consistently null.
  2. Lineage Explosion (The Multiplier Effect):

    • When a job runs to link this duplicated input field to an output field, Marquez generates lineage edges for ALL duplicates.
    • Expected: 1 Input Field UUID → 1 Output Field UUID = 1 row in column_lineage.
    • Actual: N Input Field UUIDs → 1 Output Field UUID = N rows in column_lineage.
  3. Consequence:

    • In my environment, I observed that the column_lineage table contained over 1,000x more rows than necessary.
    • Nearly 99.9% of the lineage data was garbage caused by this duplication.
    • This massive overhead causes API timeouts when querying lineage and significantly increases database storage usage.

Steps to Reproduce

  1. Emit an OpenLineage event for a dataset where a field's type is explicitly null.
  2. Emit the same event multiple times (e.g., 10 times).
  3. Check the dataset_fields table; you will see 10 duplicate rows for the same field name.
  4. Emit a columnLineage event involving that field.
  5. Check the column_lineage table; you will see 10 lineage rows created for what should be a single relationship.

Suggested Fix

Modify the INSERT query in the DAO layer to handle NULL types explicitly.

Using COALESCE(:type, 'UNKNOWN') (or a similar default value) ensures that the ON CONFLICT clause correctly matches existing rows and performs an update instead of a duplicate insert.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions