You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 7, 2021. It is now read-only.
I wrote a comparison between this package and pandas / pyarrow_ops on a task where we join, group and aggregate two tables with ~300K rows. The Datafusion package is about 3-5 times slower than its alternatives.
What is causing this performance hit? Is it the serialization between C / Python or is it the performance of DataFusion itself?
importdatafusion, timeimportpyarrowaspaimportpyarrow.parquetaspqfrompyarrow_opsimporthead, join, groupbyf1="path to local parquet file"f2="path to local parquet file"# Pandastime1=time.time()
t1=pq.read_table(f1, columns=['sku_key', 'option_key']).to_pandas()
t2=pq.read_table(f2, columns=['sku_key', 'economical']).to_pandas()
r=t1.merge(t2, on=['sku_key']).groupby(['option_key']).agg({'economical': 'sum'})
print("Query in Pandas took:", time.time() -time1)
print(r.head())
# Pyarrow opstime2=time.time()
t1=pq.read_table(f1, columns=['sku_key', 'option_key'])
t2=pq.read_table(f2, columns=['sku_key', 'economical'])
r=groupby(join(t1, t2, on=['sku_key']), by=['option_key']).agg({'economical': 'sum'})
print("\nQuery in Pyarrow ops took:", time.time() -time2)
head(r)
# Datafusiontime3=time.time()
f=datafusion.functionsctx=datafusion.ExecutionContext()
ctx.register_parquet("skus", f1)
ctx.register_parquet("stock_current", f2)
result=ctx.sql("SELECT option_key, SUM(economical) as stock FROM stock_current as sc JOIN skus as sk USING (sku_key) GROUP BY option_key").collect()
r=pa.Table.from_batches(result)
print("\nQuery in DataFusion took:", time.time() -time3)
head(r)
I wrote a comparison between this package and pandas / pyarrow_ops on a task where we join, group and aggregate two tables with ~300K rows. The Datafusion package is about 3-5 times slower than its alternatives.
What is causing this performance hit? Is it the serialization between C / Python or is it the performance of DataFusion itself?