Skip to content

Processing on a laptop with polars streaming API (untested on full dataset) #4

@marius-mather

Description

@marius-mather

I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds. As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.

If anyone is able to give it a try on the full dataset, please do!

import glob
import time
import polars as pl


def aggregate(df: pl.LazyFrame) -> pl.LazyFrame:
    return (
        df.group_by("station")
        .agg(pl.col("measure").min().alias("min"),
             pl.col("measure").max().alias("max"),
             pl.col("measure").mean().alias("mean"))
        .sort("station")
    )


if __name__ == "__main__":
    print("Polars config:", pl.Config.state())
    start = time.time()
    files = glob.glob("data/*.parquet")
    data = pl.scan_parquet(list(files), rechunk=True)
    query = aggregate(data)
    result = query.collect(streaming=True)
    end = time.time()
    print(f"Time: {end - start}")
    print(result)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions