Skip to content

Releases: MTSWebServices/data-rentgen

0.5.1 (2026-04-07)

07 Apr 15:28
d536bb5

Choose a tag to compare

Features

Add support for OpenLineage integration for StarRocks (proprietary, part of MWS Data Engine) (#424).

Improvements

  • Added indirect job dependencies inferred from lineage relations (#416).

    GET /v1/jobs/hierarchy now has additional query parameters:

    • infer_from_lineage: bool = False
    • since: datetime | None = None
    • until: datetime | None = None

    If infer_from_lineage=True and since is set, then job -> dataset -> job lineage relations are treated as indirect job -> job dependency relations.

  • Improve GET /v1/jobs/hierarchy graph with depth>1 (#423).

    Previously only first level of hierarchy graph was used to get job ancestors/descendants. Now each hierarchy level includes these relations, including lineage-inferred jobs.

  • Added optional external_id and external_url fields to datasets (#432).

    These fields allow linking Data.Rentgen datasets with third-party systems, like MWS Data Cat (proprietary). They are not filled up automatically by consumer.

0.5.0 (2026-03-19)

19 Mar 15:12
3ed8514

Choose a tag to compare

OpenLineage-related features

Extracting dataset & job tags

#367, #368, #369, #372

  • Now DataRentgen extracts tags from OpenLineage events:

    • dataset tags (currently not reported by any integration)
    • job & run tags
  • Some of tags are created based on engine versions:

    • airflow.version
    • dbt.version
    • flink.version
    • hive.version
    • spark.version
    • openlineage_adapter.version
    • openlineage_client.version (only for Python client v1.38.0 or higher)

Note that passing job & run tags depends on integration. For example, tags can be setup for Spark, Airflow and dbt, but not for Flink or Hive. Also tags are configured in a different way in each integration.

Extracting nominalTime

#378

Now DataRentgen extracts nominalTime run facet, and stores values in run.expected_start_at, run.expected_end_at fields.

Extracting jobDependencies

#402

Now DataRentgen extracts information from jobDependencies facet, and store it in job_dependency table. For now this is just a simple tuple from_dataset_id, to_dataset_id, type (arbitrary string provided by integration, not enum). This can be changed in future versions of Data.Rentgen.

Currently the only integration providing this kind of information is Airflow. But it is implemented only in most recent version of OpenLineage provider for Airflow (2.10 or higher). For now provider also doesn't send facet with information about direct task -> task dependencies - only indirect ones are included (declared via Asset). So there is a fallback for Airflow which extracts these dependencies from downstream_task_ids and upstream_task_ids task fields.

REST API features

Added GET /v1/jobs/hierarchy endpoint

This endpoint can be used retrieve job hierarchy graph (parents, dependencies) for a given job. (#407, #412)

Response example
{
    "relations": {
        "parents": [
            {
                "from": {"kind": "JOB", "id": "1"},
                "to": {"kind": "JOB", "id": "2"}
            }
        ],
        "dependencies": [
            {
                "from": {"kind": "JOB", "id": "3"},
                "to": {"kind": "JOB", "id": "1"},
                "type": "DIRECT_DEPENDENCY"
            },
            {
                "from": {"kind": "JOB", "id": "1"},
                "to": {"kind": "JOB", "id": "4"},
                "type": "DIRECT_DEPENDENCY"
            }
        ]
    },
    "nodes": {
        "jobs": {
            "1": {
                "id": 1,
                "parent_job_id": null,
                "name": "my_job",
                "type": "SPARK_APPLICATION",
                "location": {
                    "name": "my_cluster",
                    "type": "YARN"
                }
            },
            "2": {
                "id": 2,
                "parent_job_id": 1,
                "name": "my_job.child_task",
                "type": "SPARK_APPLICATION",
                "location": {
                    "name": "my_cluster",
                    "type": "YARN"
                }
            },
            "3": {
                "id": 3,
                "parent_job_id": null,
                "name": "source_job",
                "type": "SPARK_APPLICATION",
                "location": {
                    "name": "my_cluster",
                    "type": "YARN"
                }
            },
            "4": {
                "id": 4,
                "parent_job_id": null,
                "name": "target_job",
                "type": "SPARK_APPLICATION",
                "location": {
                    "name": "my_cluster",
                    "type": "YARN"
                }
            }
        }
    }
}

Added parent relation between jobs

Jobs can now reference a parent job via parent_job_id field. (#394)

Before:

Response example
{
    "meta": { ... },
    "items": [
        {
            "id": "42",
            "data": {
                "id": "42",
                "name": "my-spark-task",
                "type": "SPARK_APPLICATION",
                "location": { ... }
            }
        }
    ]
}

After:

Response example
{
    "meta": { ... },
    "items": [
        {
            "id": "42",
            "data": {
                "id": "42",
                "name": "my-spark-task",
                "type": "SPARK_APPLICATION",
                "location": { ... },
                "parent_job_id": "10"
            }
        }
    ]
}

Added JOB-JOB and RUN-RUN relations to lineage API

For example, it is possible to get Airflow DAG → Airflow Task → Spark app chain from a single response. (#392, #399, #401)

Before:

Response example
{
    "relations": {
        "parents": [
            {"from": {"kind": "JOB", "id": "1"}, "to": {"kind": "RUN", "id": "parent-run-uuid"}},
            {"from": {"kind": "JOB", "id": "2"}, "to": {"kind": "RUN", "id": "run-uuid"}}
        ],
        "symlinks": [],
        "inputs": [...],
        "outputs": [...]
    },
    "nodes": {...}
}

After:

Response example
{
    "relations": {
        "parents": [
            {"from": {"kind": "JOB", "id": "1"}, "to": {"kind": "RUN", "id": "parent-run-uuid"}},
            {"from": {"kind": "JOB", "id": "2"}, "to": {"kind": "RUN", "id": "run-uuid"}},
            # NEW:
            {"from": {"kind": "JOB", "id": "1"}, "to": {"kind": "JOB", "id": "2"}},
            {"from": {"kind": "RUN", "id": "parent-run-uuid"}, "to": {"kind": "RUN", "id": "run-uuid"}}
        ],
        "symlinks": [],
        "inputs": [...],
        "outputs": [...]
    },
    "nodes": {...}
}

Include job to GET /v1/runs response

This allows to show job type & name for specific run without sending additional requests. #411

Before:

Response example
{
    "meta": {
        "page": 1,
        "page_size": 20,
        "total_count": 1,
        "pages_count": 1,
        "has_next": False,
        "has_previous": False,
        "next_page": None,
        "previous_page": None,
    },
    "items": [
        {
            "id": "01908224-8410-79a2-8de6-a769ad6944c9",
            "data": {
                "id": "01908224-8410-79a2-8de6-a769ad6944c9",
                "created_at": "2024-07-05T09:05:49.584000",
                "job_id": "123",
                ...
            },
            "statistics": { ... }
        }
    ]
}

After:

Response example
{
    "meta": {
        "page": 1,
        "page_size": 20,
        "total_count": 1,
        "pages_count": 1,
        "has_next": False,
        "has_previous": False,
        "next_page": None,
        "previous_page": None,
    },
    "items": [
        {
            "id": "01908224-8410-79a2-8de6-a769ad6944c9",
            "data": {
                "id": "01908224-8410-79a2-8de6-a769ad6944c9",
                "created_at": "2024-07-05T09:05:49.584000",
                "job_id": "123",
                ...
            },
            "job": {
                "id": "123",
                "name": "myjob",
                ...
            },
            "statistics": { ... }
        }
    ]
}

Include last_run field to GET /v1/jobs response

This allows to show last start time, status and duration for each job in the list, without additional requests. #387

Before:

Response example
{
    "meta": { ... },
    "items": [
        {
            "id": "42",
            "data": {
                "id": "42",
                "name": "my-spark-task",
                "type": "SPARK_APPLICATION",
                "location": { ... },
                "parent_job_id": "10"
            }
        }
    ]
}

After:

Response example
{
    "meta": { ... },
    "items": [
        {
            "id": "42",
            "data": {
                "id": "42",
                "name": "my-spark-task",
                "type": "SPARK_APPLICATION",
                "location": { ... },
                "parent_job_id": "10"
            },
            "last_run": {
                "id": "01908224-8410-79a2-8de6-a769ad6944c9",
                "created_at": "2024-07-05T09:05:49.584000",
                "job_id": "123",
                ...
            }
        }
    ]
}

0.4.8 (2025-01-26)

26 Jan 08:56
8a00eda

Choose a tag to compare

Fixed issue with updating Location's external_id field - server returned response code 200 but ignored the input value.

0.4.7 (2025-01-20)

20 Jan 13:52
9bb852a

Choose a tag to compare

Dependency-only updates.

0.4.6 (2025-01-12)

12 Jan 14:23
0047ff9

Choose a tag to compare

Dependency-only updates.

0.4.5 (2025-12-24)

24 Dec 15:48
02b53ee

Choose a tag to compare

Improvements

Allow disabling SessionMiddleware, as it only required by KeycloakAuthProvider.

0.4.4 (2025-11-21)

21 Nov 16:51
d76fdb5

Choose a tag to compare

Bug Fixes

  • 0.4.3 release broken inputs with 0 bytes statistics, fixed

0.4.3 (2025-11-21)

21 Nov 15:52
04d73bb

Choose a tag to compare

Features

  • Disable server.session.enabled by default. It is required only by KeycloakAuthProvider which is not used by default.

Bug Fixes

  • Escape unprintable ASCII symbols in SQL queries before storing them in Postgres. Previously saving queries containing \x00 symbol lead to exceptions.
  • Kafka topic with malformed messages doesn't have to use the same number partitions as input topics.
  • Prevent OpenLineage from reporting events which claim to read 8 Exabytes of data, this is actually a Spark quirk.

0.4.2 (2025-10-29)

29 Oct 15:32
bb01ca3

Choose a tag to compare

Bug fixes

  • Fix search query filter on UI Run list page.
  • Fix passing multiple filters to GET /v1/runs.

Doc only Changes

  • Document DATA_RENTGEN__UI__AUTH_PROVIDER config variable.

0.4.1 (2025-10-08)

08 Oct 14:15
c5a2ade

Choose a tag to compare

Features

  • Add new GET /v1/locations/types endpoint returning list of all known location types. (#328)

  • Add new filter to GET /v1/jobs (#328):

    • location_type: list[str]
  • Add new filter to GET /v1/datasets (#328):

    • location_type: list[str]
  • Allow passing multiple location_type filters to GET /v1/locations. (#328)

  • Allow passing multiple values to GET endpoinds with filters like job_id, parent_run_id, and so on. (#329)