Processing on a laptop with polars streaming API (untested on full dataset)

I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds.  As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.

If anyone is able to give it a try on the full dataset, please do!

```python
import glob
import time
import polars as pl


def aggregate(df: pl.LazyFrame) -> pl.LazyFrame:
    return (
        df.group_by("station")
        .agg(pl.col("measure").min().alias("min"),
             pl.col("measure").max().alias("max"),
             pl.col("measure").mean().alias("mean"))
        .sort("station")
    )


if __name__ == "__main__":
    print("Polars config:", pl.Config.state())
    start = time.time()
    files = glob.glob("data/*.parquet")
    data = pl.scan_parquet(list(files), rechunk=True)
    query = aggregate(data)
    result = query.collect(streaming=True)
    end = time.time()
    print(f"Time: {end - start}")
    print(result)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing on a laptop with polars streaming API (untested on full dataset) #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Processing on a laptop with polars streaming API (untested on full dataset) #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions