https://duckdb.org/2024/11/14/optimizers.html#filter-pull-up--filter-pushdown has a nice description of filter pull up, an optimization in DuckDB that I'd like to implement in dask-expr as a learning exercise. dask-expr currently implements predicate push down, where a read_parquet followed by a filter on the values of a column is translated into a read_parquet with the filters set appropriately:
df = dd.read_parquet("file.parquet")
df = df[df["name"] == "Alice"]
is optimized to
df = dd.read_parquet("file.parquet", filters=[("name", "=", "Alice")])
The idea of a predicate pull-up is similar, but applied to the other side of an join / merge:
left = pd.read_parquet("left.parquet")
right = pd.read_parquet("right.parquet")
left = left[left["name"] == "Alice"]
# Both left and right should get the `filters=` pushed down to the read_parquet
result = left.merge(right, on="name", how="inner")
We know that the result of the inner join will only have name=="Alice" since the left DataFrame will only have name=="Alice" thanks to the preceding filter. Because the filter column is also the join column, dask-expr should be able to pull that filter all the way up through the merge and then push it down to the right = pd.read_parquet, just like it does for the left side.
https://github.com/dask/dask-expr/compare/main...TomAugspurger:dask-expr:tom/predicate-pull-up?expand=1 has an initial cut at this that I'll turn into a PR soon. The basic idea is to implement Merge._simplify_down to check for the prerequisites for this optimization (I've only implemented a very specific case, but I think the main / exclusive requirement is that there's a join and filter on the same column). I think that _simplify_down is the appropriate place to do this. IIUC, the Merge is "above" the read_parquet. We want to take the Filter that's on some side of the Merge, pull it up to the `Merge, and push it down the other side.
(I don't know whether this is an important optimization in practice, but it seemed like a pretty good problem to learn a bit about how dask-expr works).
https://duckdb.org/2024/11/14/optimizers.html#filter-pull-up--filter-pushdown has a nice description of filter pull up, an optimization in DuckDB that I'd like to implement in dask-expr as a learning exercise. dask-expr currently implements predicate push down, where a
read_parquetfollowed by a filter on the values of a column is translated into aread_parquetwith thefiltersset appropriately:is optimized to
The idea of a predicate pull-up is similar, but applied to the other side of an join / merge:
We know that the result of the inner join will only have
name=="Alice"since theleftDataFrame will only havename=="Alice"thanks to the preceding filter. Because the filter column is also the join column, dask-expr should be able to pull that filter all the way up through the merge and then push it down to theright = pd.read_parquet, just like it does for theleftside.https://github.com/dask/dask-expr/compare/main...TomAugspurger:dask-expr:tom/predicate-pull-up?expand=1 has an initial cut at this that I'll turn into a PR soon. The basic idea is to implement
Merge._simplify_downto check for the prerequisites for this optimization (I've only implemented a very specific case, but I think the main / exclusive requirement is that there's a join and filter on the same column). I think that_simplify_downis the appropriate place to do this. IIUC, theMergeis "above" theread_parquet. We want to take theFilterthat's on some side of theMerge, pull it up to the `Merge, and push it down the other side.(I don't know whether this is an important optimization in practice, but it seemed like a pretty good problem to learn a bit about how dask-expr works).