-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54358][SDP] Checkpoint dirs collide when streaming tables in different schemas have same name #53070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| graph.sink.contains(flow.destinationIdentifier)) { | ||
| val checkpointRoot = new Path(context.storageRoot, "_checkpoints") | ||
| val flowTableName = flow.destinationIdentifier.table | ||
| val flowTableId = flow.destinationIdentifier.nameParts.mkString(Path.SEPARATOR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*.nameParts.mkString(Path.SEPARATOR) looks a little tacky to me. Could you add more function comment about this at line 41?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work for Apache Iceberg database and tables, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 will add a comment.
This is agnostic to the table format; it's just controlling the directory that we're storing streaming checkpoints in. That directory is keyed by the table name, but we're not actually putting it inside the table directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it later as a follow-up. Let me merge this first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 here's the followup: #53089. I slotted it under the same JIRA because it's a continuation of this PR, but let me know if it's preferred to file a new JIRA.
| ): Path = { | ||
| val expectedRawCheckPointDir = tableOrSinkElement match { | ||
| case t if t.isInstanceOf[Table] || t.isInstanceOf[Sink] => | ||
| val tableId = t.identifier.nameParts.mkString(Path.SEPARATOR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need this multiple places, we had better a utility method for this conversion to be safe, @sryza .
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @sryza .
Merged to master/4.1 for Apache Spark 4.1.0.
…ifferent schemas have same name ### What changes were proposed in this pull request? Updates the per-streaming table checkpoint path to use the fully qualified table path, instead of just its name. ### Why are the changes needed? A streaming table is a table fed by a stream. Streaming tables have checkpoint directories underneath their pipeline's storage root. These directories don't currently take the table namespace into account, which means that two tables with different schemas but the same name will be mapped to the same checkpoint directory. This could be very bad and cause data loss. ### Does this PR introduce _any_ user-facing change? Yes, but for an unreleased feature. ### How was this patch tested? Added a test for the collision case. Verified that it fails with the prior logic and now passes. ### Was this patch authored or co-authored using generative AI tooling? Closes #53070 from sryza/collide. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e09c999) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Updates the per-streaming table checkpoint path to use the fully qualified table path, instead of just its name.
Why are the changes needed?
A streaming table is a table fed by a stream. Streaming tables have checkpoint directories underneath their pipeline's storage root. These directories don't currently take the table namespace into account, which means that two tables with different schemas but the same name will be mapped to the same checkpoint directory. This could be very bad and cause data loss.
Does this PR introduce any user-facing change?
Yes, but for an unreleased feature.
How was this patch tested?
Added a test for the collision case. Verified that it fails with the prior logic and now passes.
Was this patch authored or co-authored using generative AI tooling?