You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 18, 2024. It is now read-only.
I have a running crate extracting data from a file into ndarray later exposed with PyO3 for big data but I would like to migrate it to arrow instead to benefit performance and arrow ecosystem.
The data blocks read are row based and I currently read several columns in parallel into already allocated ndarrays using rayon from a chunk of data in memory (not whole data block in order to avoid consuming too much memory). Types are mainly primitives but also utf8, complex and multi dimensional arrays of primitive or complex. File bytes can be little or big endians.
Can I have your advice on best method to read this data with arrow2 ? A few thoughts and directions I considered so far:
I noticed there is Buffer to create a PrimitiveArray but comparing to official arrow crate, there is not from_bytes() implementation and PrimitiveArray does not vectorise Buffer. How to input data chunk by chunk into a Buffer ? Is there a way to input bytes along with a DataType ?
It seems I could use MutablePrimitiveArray like a Vec with defined capacity but I am not sure this is performing. I am also bit afraid of cost to convert to PrimitiveArray.
Is it still acceptable to simply keep my ndarray implementation and at the end convert everything back to arrow2 ? I did not see any zero copy from ndarray.
I have a running crate extracting data from a file into ndarray later exposed with PyO3 for big data but I would like to migrate it to arrow instead to benefit performance and arrow ecosystem.
The data blocks read are row based and I currently read several columns in parallel into already allocated ndarrays using rayon from a chunk of data in memory (not whole data block in order to avoid consuming too much memory). Types are mainly primitives but also utf8, complex and multi dimensional arrays of primitive or complex. File bytes can be little or big endians.
Can I have your advice on best method to read this data with arrow2 ? A few thoughts and directions I considered so far: