Skip to content

Hadoop parquet reader (and parquet-CLI) fails on some files from parquet-testing #3336

@ccleva

Description

@ccleva

Describe the bug, including details regarding any error messages, version, and platform.

Tested using v1.16.0 on openJDK 11 and 17.

  1. nation.dict-malformed.parquet
> java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat nation.dict-malformed.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file nation.dict-malformed.parquet
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
	at org.apache.parquet.cli.Main.run(Main.java:169)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: java.lang.RuntimeException: Failed while reading Parquet file: nation.dict-malformed.parquet
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:360)
	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
	... 3 more
Caused by: java.io.EOFException
	at org.apache.parquet.bytes.SingleBufferInputStream.sliceBuffers(SingleBufferInputStream.java:134)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAsBytesInput(ParquetFileReader.java:2100)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1990)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1920)
	at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1454)
	at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1188)
	at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:1135)
	at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1380)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
	... 6 more

This seems related to an issue with an older version of the java writer: apache/arrow#42298

It's been fixed in the C++/python version by apache/parquet-cpp#209 (file loads fine with pyArrow), but maybe not in the hadoop reader ?

Links to the fix and the test in current version:
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/file_reader.cc#L199
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/reader_test.cc#L977

Note that the file can be read by the old parquet-tools (tested with v1.10.1).

  1. fixed_length_byte_array.parquet

pyArrow (and parquet-tools) also fails to read this one, so I think it's a wider problem. I'll open an issue on their repository, this is more to let you know.

> java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat fixed_length_byte_array.parquet
{"flba_field": [0, 0, 3, -24]}
[...]
{"flba_field": [0, 0, 3, -122]}
Unknown error
java.lang.RuntimeException: Failed on record 90 in file fixed_length_byte_array.parquet
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
	at org.apache.parquet.cli.Main.run(Main.java:169)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 92 in block 0 in file file:/home/ccleva/dev/tlabs-data/tablesaw-parquet/target/test/data/parquet-testing-master/data/fixed_length_byte_array.parquet
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
	at org.apache.parquet.cli.BaseCommand$1$1.next(BaseCommand.java:350)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
	... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [flba_field] required fixed_len_byte_array(4) flba_field at value 92 out of 1000, 92 out of 100 in currentPage. repetition level: 0, definition level: 0
	at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:604)
	at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
	at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:477)
	at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
	at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:425)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:249)
	... 7 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read bytes at offset 364
	at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:47)
	at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:411)
	at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:579)
	... 12 more
Caused by: java.io.EOFException
	at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
	at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:45)
	... 14 more

I recreated the file using the script in apache/parquet-testing#31 in case it was corrupted but got the same result (at a different offset).

Component(s)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions