Fail retention app when the columnPattern mismatch partition spec for…#552
Fail retention app when the columnPattern mismatch partition spec for…#552jiang95-dev wants to merge 1 commit intolinkedin:mainfrom
Conversation
| TableScan scan = table.newScan().filter(filter); | ||
| try (CloseableIterable<FileScanTask> filesIterable = scan.planFiles()) { | ||
| List<FileScanTask> filesList = Lists.newArrayList(filesIterable); | ||
| for (FileScanTask task : filesList) { |
There was a problem hiding this comment.
nit. I would use filesList.exists functional pattern.
| List<FileScanTask> filesList = Lists.newArrayList(filesIterable); | ||
| for (FileScanTask task : filesList) { | ||
| if (task.residual() != Expressions.alwaysTrue()) { | ||
| throw new IllegalStateException( |
There was a problem hiding this comment.
Where are we monitoring for this?
mkuchenbecker
left a comment
There was a problem hiding this comment.
- How do we know when this is failing?
- When this is failing, what is the impact on the customer's limited retention? What is the AI for the customer / us?
There was a problem hiding this comment.
this will add toil to operations, the operations is already at peak so we can't choose to add toil.
why toilsome?
- user calls SET POLICY ddl
- 1 out of 60,000 async jobs starts failing. impossible to alert on
- many weeks go by
- user raises ticket about unmatched expectation between their policy, and data
- this problem also is obfuscated when the retention policy comes from UI instead of table properties.
- oncall needs to dig into retention app logs and triage to dig out this exception in the spark driver logs.
multiply that^ ticket burden across N tables which have retention policy on non-partitioned column.
Instead this needs to be fixed at the root cause:
Sometimes users use a non-partitioned column for retention policy
Somewhere there is a bug:
a) either explicitly disallow DDL to SET POLICY ... retention on a non-partition column
b) if that^ is already done, how are these tables leaking through? it must be an edge case we're not tolerating somehow e.g. maybe a gap in server validation.
Summary
Fail retention app when the columnPattern mismatch partition spec for HCR tables. Sometimes users use a non-partitioned column for retention policy, and this will cause rewrite data files in the retention job. If we enable rename, this will cause data loss. So we decided to fail the operation when the delete is not a metdata-only operation. We detect it by checking if all the planned files have no residual from the the patition data.
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.