-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Describe the bug, including details regarding any error messages, version, and platform.
Tested using v1.16.0 on openJDK 11 and 17.
> java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat nation.dict-malformed.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file nation.dict-malformed.parquet
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
at org.apache.parquet.cli.Main.run(Main.java:169)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: java.lang.RuntimeException: Failed while reading Parquet file: nation.dict-malformed.parquet
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:360)
at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
... 3 more
Caused by: java.io.EOFException
at org.apache.parquet.bytes.SingleBufferInputStream.sliceBuffers(SingleBufferInputStream.java:134)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAsBytesInput(ParquetFileReader.java:2100)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1990)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1920)
at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1454)
at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1188)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:1135)
at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1380)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
... 6 more
This seems related to an issue with an older version of the java writer: apache/arrow#42298
It's been fixed in the C++/python version by apache/parquet-cpp#209 (file loads fine with pyArrow), but maybe not in the hadoop reader ?
Links to the fix and the test in current version:
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/file_reader.cc#L199
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/reader_test.cc#L977
Note that the file can be read by the old parquet-tools (tested with v1.10.1).
pyArrow (and parquet-tools) also fails to read this one, so I think it's a wider problem. I'll open an issue on their repository, this is more to let you know.
> java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat fixed_length_byte_array.parquet
{"flba_field": [0, 0, 3, -24]}
[...]
{"flba_field": [0, 0, 3, -122]}
Unknown error
java.lang.RuntimeException: Failed on record 90 in file fixed_length_byte_array.parquet
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
at org.apache.parquet.cli.Main.run(Main.java:169)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 92 in block 0 in file file:/home/ccleva/dev/tlabs-data/tablesaw-parquet/target/test/data/parquet-testing-master/data/fixed_length_byte_array.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
at org.apache.parquet.cli.BaseCommand$1$1.next(BaseCommand.java:350)
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [flba_field] required fixed_len_byte_array(4) flba_field at value 92 out of 1000, 92 out of 100 in currentPage. repetition level: 0, definition level: 0
at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:604)
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:477)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:425)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:249)
... 7 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read bytes at offset 364
at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:47)
at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:411)
at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:579)
... 12 more
Caused by: java.io.EOFException
at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:45)
... 14 more
I recreated the file using the script in apache/parquet-testing#31 in case it was corrupted but got the same result (at a different offset).
Component(s)
No response