Skip to content
This repository was archived by the owner on Feb 18, 2024. It is now read-only.
This repository was archived by the owner on Feb 18, 2024. It is now read-only.

Reading data chunk by chunk #987

@ratal

Description

@ratal

I have a running crate extracting data from a file into ndarray later exposed with PyO3 for big data but I would like to migrate it to arrow instead to benefit performance and arrow ecosystem.
The data blocks read are row based and I currently read several columns in parallel into already allocated ndarrays using rayon from a chunk of data in memory (not whole data block in order to avoid consuming too much memory). Types are mainly primitives but also utf8, complex and multi dimensional arrays of primitive or complex. File bytes can be little or big endians.

Can I have your advice on best method to read this data with arrow2 ? A few thoughts and directions I considered so far:

  1. I noticed there is Buffer to create a PrimitiveArray but comparing to official arrow crate, there is not from_bytes() implementation and PrimitiveArray does not vectorise Buffer. How to input data chunk by chunk into a Buffer ? Is there a way to input bytes along with a DataType ?
  2. It seems I could use MutablePrimitiveArray like a Vec with defined capacity but I am not sure this is performing. I am also bit afraid of cost to convert to PrimitiveArray.
  3. Is it still acceptable to simply keep my ndarray implementation and at the end convert everything back to arrow2 ? I did not see any zero copy from ndarray.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions