Skip to content

Conversation

@vfxcode
Copy link

@vfxcode vfxcode commented Dec 16, 2025

This PR is meant to allow Nessie GC to run in self-hosted Hadoop clusters using HDFS.

The original Error when using HDFS stored warehouses is the following:

Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file system for path: hdfs://hadoop-namenode:9000/user/iceberg/...../metadata/v4603.metadata.json
    at org.apache.iceberg.hadoop.Util.getFs(Util.java:57)
    at org.apache.iceberg.hadoop.HadoopInputFile.fromLocation(HadoopInputFile.java:56)
    at org.apache.iceberg.hadoop.HadoopFileIO.newInputFile(HadoopFileIO.java:87)
    at org.apache.iceberg.io.ResolvingFileIO.newInputFile(ResolvingFileIO.java:90)
    at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:294)
    at org.projectnessie.gc.iceberg.IcebergContentToFiles.extractTableFiles(IcebergContentToFiles.java:129)
    ... 27 more
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "hdfs"
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3581)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3612)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:172)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3716)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3667)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:557)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:366)
    at org.apache.iceberg.hadoop.Util.getFs(Util.java:55)

This PR was tested in a couple of our environments and it seems to work mostly correct. We could not observe any data loss and the files that remain seem to be the expected ones.

There is another way to make it work without this patch, by using the hadoop-hdfs and hadoop-hdfs-client jars directly from maven using this command line:

java -cp "/tmp/hadoop-hdfs-3.4.2.jar:/tmp/hadoop-hdfs-client-3.4.2.jar:/nessie-gc.jar" org.projectnessie.gc.tool.cli.CLI gc

Not sure if I am missing anything from the bigger picture to be honest as I am not very familiar with Nessie yet.

@CLAassistant
Copy link

CLAassistant commented Dec 16, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @vfxcode !

exclude("org.apache.zookeeper")
}
implementation(libs.hadoop.hdfs)
implementation(libs.hadoop.hdfs.client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be intTestRuntimeOnly like iceberg-aws?

exclude("org.apache.zookeeper")
}
implementation(libs.hadoop.hdfs)
implementation(libs.hadoop.hdfs.client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtimeOnly like iceberg-aws?

exclude("org.apache.hadoop")
exclude("org.apache.zookeeper")
}
implementation(libs.hadoop.hdfs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test similar to ITSparkIcebergNessieS3 but for HDFS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with Java let alone with Java Test frameworks.
Coudl I take a look at it but at a later time to not block this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know GC works with HDFS now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual testing is fine in this case, IMHO... but if CI-based tests are not practical, we should probably not add HDFS to "test" dependencies for the sake of clarity.

@dimas-b dimas-b requested a review from snazy December 16, 2025 19:13
@vfxcode
Copy link
Author

vfxcode commented Dec 17, 2025

@dimas-b I will do another round of tests with the changes you proposed and verify that it still works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants