[SPARK-54358][SDP] Checkpoint dirs collide when streaming tables in different schemas have same name #53070

sryza · 2025-11-14T17:03:51Z

What changes were proposed in this pull request?

Updates the per-streaming table checkpoint path to use the fully qualified table path, instead of just its name.

Why are the changes needed?

A streaming table is a table fed by a stream. Streaming tables have checkpoint directories underneath their pipeline's storage root. These directories don't currently take the table namespace into account, which means that two tables with different schemas but the same name will be mapped to the same checkpoint directory. This could be very bad and cause data loss.

Does this PR introduce any user-facing change?

Yes, but for an unreleased feature.

How was this patch tested?

Added a test for the collision case. Verified that it fails with the prior logic and now passes.

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun · 2025-11-14T18:42:55Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/SystemMetadata.scala

      graph.sink.contains(flow.destinationIdentifier)) {
      val checkpointRoot = new Path(context.storageRoot, "_checkpoints")
-      val flowTableName = flow.destinationIdentifier.table
+      val flowTableId = flow.destinationIdentifier.nameParts.mkString(Path.SEPARATOR)


*.nameParts.mkString(Path.SEPARATOR) looks a little tacky to me. Could you add more function comment about this at line 41?

Does this work for Apache Iceberg database and tables, too?

👍 will add a comment.

This is agnostic to the table format; it's just controlling the directory that we're storing streaming checkpoints in. That directory is keyed by the table name, but we're not actually putting it inside the table directory.

We can do it later as a follow-up. Let me merge this first.

👍 here's the followup: #53089. I slotted it under the same JIRA because it's a continuation of this PR, but let me know if it's preferred to file a new JIRA.

dongjoon-hyun · 2025-11-14T18:43:37Z

sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/SystemMetadataSuite.scala

  ): Path = {
    val expectedRawCheckPointDir = tableOrSinkElement match {
      case t if t.isInstanceOf[Table] || t.isInstanceOf[Sink] =>
+        val tableId = t.identifier.nameParts.mkString(Path.SEPARATOR)


If we need this multiple places, we had better a utility method for this conversion to be safe, @sryza .

dongjoon-hyun

+1, LGTM. Thank you, @sryza .

Merged to master/4.1 for Apache Spark 4.1.0.

…ifferent schemas have same name ### What changes were proposed in this pull request? Updates the per-streaming table checkpoint path to use the fully qualified table path, instead of just its name. ### Why are the changes needed? A streaming table is a table fed by a stream. Streaming tables have checkpoint directories underneath their pipeline's storage root. These directories don't currently take the table namespace into account, which means that two tables with different schemas but the same name will be mapped to the same checkpoint directory. This could be very bad and cause data loss. ### Does this PR introduce _any_ user-facing change? Yes, but for an unreleased feature. ### How was this patch tested? Added a test for the collision case. Verified that it fails with the prior logic and now passes. ### Was this patch authored or co-authored using generative AI tooling? Closes #53070 from sryza/collide. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e09c999) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

collide

ead1159

github-actions bot added the SQL label Nov 14, 2025

dongjoon-hyun reviewed Nov 14, 2025

View reviewed changes

dongjoon-hyun approved these changes Nov 15, 2025

View reviewed changes

dongjoon-hyun closed this in e09c999 Nov 15, 2025

sryza mentioned this pull request Nov 17, 2025

[SPARK-54358][SDP][FOLLOWUP] Improve code clarity of SDP checkpoint path construction #53089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54358][SDP] Checkpoint dirs collide when streaming tables in different schemas have same name #53070

[SPARK-54358][SDP] Checkpoint dirs collide when streaming tables in different schemas have same name #53070

Uh oh!

sryza commented Nov 14, 2025

Uh oh!

dongjoon-hyun Nov 14, 2025

Uh oh!

dongjoon-hyun Nov 14, 2025

Uh oh!

sryza Nov 15, 2025

Uh oh!

dongjoon-hyun Nov 15, 2025

Uh oh!

sryza Nov 17, 2025

Uh oh!

dongjoon-hyun Nov 14, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54358][SDP] Checkpoint dirs collide when streaming tables in different schemas have same name #53070

[SPARK-54358][SDP] Checkpoint dirs collide when streaming tables in different schemas have same name #53070

Uh oh!

Conversation

sryza commented Nov 14, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

sryza Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

sryza Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants