-
Notifications
You must be signed in to change notification settings - Fork 13
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
Currently, flow.record data is stored in a custom record-based format. While efficient for streaming and sequential access, it presents challenges for:
- Data Science & Analytics: Integrating with modern data stacks (Pandas, Polars, DuckDB, Spark) requires converting data first, creating friction for analysts.
- Columnar Analysis: Some use cases only need a subset of fields (e.g., just the
src_ipanddst_ipcolumns). Our current row-based formats require reading the entire record, which is inefficient for these access patterns. - Storage Efficiency: While we support compression, columnar formats like Parquet often offer better compression ratios for repetitive data types.
Proposed Solution
Implement an adapter for Apache Parquet, a standard open-source columnar storage format. This would allow flow.record to natively read and write Parquet files.
Benefits
- First-class Interoperability:
flow.recorddatasets could be directly queried by DuckDB or loaded into Pandas/Polars DataFrames without intermediate conversion steps. - Improved Performance for Analytics: Users can read only the specific columns they need, significantly reducing I/O for wide records.
- Ecosystem Integration: Opens the door to using the vast ecosystem of tools that support Parquet (AWS Athena, BigQuery, Spark, etc.).
- Efficient Storage: Parquet's columnar compression and encoding schemes (RLE, Dictionary, etc.) are highly effective for strictly typed telemetry data.
- Metadata: The number of rows and other metadata is stored in the Parquet metadata. This information could be leveraged by
rdump. (e.g.: improved progress bar).
Implementation Details
- Add a ParquetWriter adapter utilizing pyarrow
- Add a ParquetReader adapter with column projection support, and also add support for this to
rdump. - Map
flow.recordtypes to Arrow types (handling complex types likedigestandpath). - Parquet support storing custom metadata, it can be used to store and read the
RecordDescriptor.
Some things to take into account
- A Parquet file supports only one schema. A workable solution needs to be made to support a source of mixed RecordDescriptors. (same problem as: csv adapter repeats header #190)
- pyarrow is a pretty big module (~42 mb), so Parquet support should be completely optional.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request