Reading data chunk by chunk

I have a running crate extracting data from a file into ndarray later exposed with PyO3 for big data but I would like to migrate it to arrow instead to benefit performance and arrow ecosystem.
The data blocks read are row based and I currently read several columns in parallel into already allocated ndarrays using rayon from a chunk of data in memory (not whole data block in order to avoid consuming too much memory). Types are mainly primitives but also utf8, complex and multi dimensional arrays of primitive or complex. File bytes can be little or big endians.

Can I have your advice on best method to read this data with arrow2 ? A few thoughts and directions I considered so far:
1. I noticed there is Buffer to create a PrimitiveArray but comparing to official arrow crate, there is not from_bytes() implementation and PrimitiveArray does not vectorise Buffer. How to input data chunk by chunk into a Buffer ? Is there a way to input bytes along with a DataType ?
2. It seems I could use MutablePrimitiveArray like a Vec with defined capacity but I am not sure this is performing. I am also bit afraid of cost to convert to PrimitiveArray.
3. Is it still acceptable to simply keep my ndarray implementation and at the end convert everything back to arrow2 ? I did not see any zero copy from ndarray.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading data chunk by chunk #987

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reading data chunk by chunk #987

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions