Skip to content

Add Apache Parquet Adapter #207

@yunzheng

Description

@yunzheng

Problem

Currently, flow.record data is stored in a custom record-based format. While efficient for streaming and sequential access, it presents challenges for:

  1. Data Science & Analytics: Integrating with modern data stacks (Pandas, Polars, DuckDB, Spark) requires converting data first, creating friction for analysts.
  2. Columnar Analysis: Some use cases only need a subset of fields (e.g., just the src_ip and dst_ip columns). Our current row-based formats require reading the entire record, which is inefficient for these access patterns.
  3. Storage Efficiency: While we support compression, columnar formats like Parquet often offer better compression ratios for repetitive data types.

Proposed Solution

Implement an adapter for Apache Parquet, a standard open-source columnar storage format. This would allow flow.record to natively read and write Parquet files.

Benefits

  • First-class Interoperability: flow.record datasets could be directly queried by DuckDB or loaded into Pandas/Polars DataFrames without intermediate conversion steps.
  • Improved Performance for Analytics: Users can read only the specific columns they need, significantly reducing I/O for wide records.
  • Ecosystem Integration: Opens the door to using the vast ecosystem of tools that support Parquet (AWS Athena, BigQuery, Spark, etc.).
  • Efficient Storage: Parquet's columnar compression and encoding schemes (RLE, Dictionary, etc.) are highly effective for strictly typed telemetry data.
  • Metadata: The number of rows and other metadata is stored in the Parquet metadata. This information could be leveraged by rdump. (e.g.: improved progress bar).

Implementation Details

  • Add a ParquetWriter adapter utilizing pyarrow
  • Add a ParquetReader adapter with column projection support, and also add support for this to rdump.
  • Map flow.record types to Arrow types (handling complex types like digest and path).
  • Parquet support storing custom metadata, it can be used to store and read the RecordDescriptor.

Some things to take into account

  • A Parquet file supports only one schema. A workable solution needs to be made to support a source of mixed RecordDescriptors. (same problem as: csv adapter repeats header #190)
  • pyarrow is a pretty big module (~42 mb), so Parquet support should be completely optional.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions