I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds. As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.
If anyone is able to give it a try on the full dataset, please do!
import glob
import time
import polars as pl
def aggregate(df: pl.LazyFrame) -> pl.LazyFrame:
return (
df.group_by("station")
.agg(pl.col("measure").min().alias("min"),
pl.col("measure").max().alias("max"),
pl.col("measure").mean().alias("mean"))
.sort("station")
)
if __name__ == "__main__":
print("Polars config:", pl.Config.state())
start = time.time()
files = glob.glob("data/*.parquet")
data = pl.scan_parquet(list(files), rechunk=True)
query = aggregate(data)
result = query.collect(streaming=True)
end = time.time()
print(f"Time: {end - start}")
print(result)
I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds. As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.
If anyone is able to give it a try on the full dataset, please do!