feat: fix schema mismatch between native and python#28
Conversation
Add basic tests that assert that functionality works at least.
|
@mrpowers-wb cc |
|
@zhuqi-lucas unfortunately I don't think anyone will review it fast enough (as usual here), so I can actually bypass review, merge and release it. Sorry, it was my fault (tbh, I didn't think that arrow is SO strict about schemas and cannot merge non-null values to the nullable column; like why not? non-null < nullable for me). Of course I had to add more tests after adding joins but I didn't. Now schemas are synced between lib.rs and python CLI wrapper and it should work. At least I hope so. If not, I will try to fix it ASAP, just ping me. |
Thank you @SemyonSinchenko for quick response and fix! And i agree, arrow should support merge non-null values to the nullable column, it may also a bug from pyarrow side. |
|
I will merge it and make a release, ping me if it won't work! |
Thank you @SemyonSinchenko , i will try it after release. |
|
It works: ./bench.sh data h2o_small_join_parquet
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: h2o_small_join_parquet
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Found Python version 3.13, which is suitable.
Using Python command: /opt/homebrew/bin/python3
Installing falsa...
Generating h2o test data in /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o with size=SMALL and format=PARQUET
10 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e1_0.parquet
10000 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e4_0.parquet
10000000 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e7_NA.parquet
An SMALL data schema is the following:
id1: int64 not null
id4: string not null
v2: double not null
An output format is PARQUET
Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
An MEDIUM data schema is the following:
id1: int64 not null
id2: int64 not null
id4: string not null
id5: string not null
v2: double not null
An output format is PARQUET
Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
An BIG data schema is the following:
id1: int64 not null
id2: int64 not null
id3: int64 not null
id4: string not null
id5: string not null
id6: string not null
v2: double not null
An output format is PARQUET
Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:02
An LSH data schema is the following:
id1: int64 not null
id2: int64 not null
id3: int64 not null
id4: string not null
id5: string not null
id6: string not null
v1: double not null
An output format is PARQUET
Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 |
Add basic tests that assert that functionality works at least.