-
Notifications
You must be signed in to change notification settings - Fork 168
gc: Add libraries for HDFS support #11751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dimas-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, @vfxcode !
gc/gc-tool-inttest/build.gradle.kts
Outdated
| exclude("org.apache.zookeeper") | ||
| } | ||
| implementation(libs.hadoop.hdfs) | ||
| implementation(libs.hadoop.hdfs.client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be intTestRuntimeOnly like iceberg-aws?
gc/gc-tool/build.gradle.kts
Outdated
| exclude("org.apache.zookeeper") | ||
| } | ||
| implementation(libs.hadoop.hdfs) | ||
| implementation(libs.hadoop.hdfs.client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtimeOnly like iceberg-aws?
gc/gc-tool-inttest/build.gradle.kts
Outdated
| exclude("org.apache.hadoop") | ||
| exclude("org.apache.zookeeper") | ||
| } | ||
| implementation(libs.hadoop.hdfs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a test similar to ITSparkIcebergNessieS3 but for HDFS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with Java let alone with Java Test frameworks.
Coudl I take a look at it but at a later time to not block this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know GC works with HDFS now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual testing is fine in this case, IMHO... but if CI-based tests are not practical, we should probably not add HDFS to "test" dependencies for the sake of clarity.
|
@dimas-b I will do another round of tests with the changes you proposed and verify that it still works |
This PR is meant to allow Nessie GC to run in self-hosted Hadoop clusters using HDFS.
The original Error when using HDFS stored warehouses is the following:
This PR was tested in a couple of our environments and it seems to work mostly correct. We could not observe any data loss and the files that remain seem to be the expected ones.
There is another way to make it work without this patch, by using the hadoop-hdfs and hadoop-hdfs-client jars directly from maven using this command line:
Not sure if I am missing anything from the bigger picture to be honest as I am not very familiar with Nessie yet.