Shuffle buffer for controlling sample correlation

## 🚀 Feature
Add a sample level shuffle buffer like `webdataset` to StreamingDataloader, to make shuffling more random when datasets are built with some correlation.

### Motivation
I have some data that correlates across samples when building, and I observe that when using a smaller number of workers, data correlation. This is because litdata only shuffles within chunk, so when the chunk is large and many samples in the chunk are correlated, the randomly drawn samples are still correlated with each other. Exact shuffle is, in general, not tractable - but a good heuristic can be to maintain a shuffle buffer explicitly like `webdataset`. As long as the buffer size is larger than a chunk, data correlation might be improved.

### My case
Here's an example:
<img width="447" height="263" alt="Image" src="https://github.com/user-attachments/assets/8f80af47-f59c-456a-aafb-188b5f5bbef7" />

The logged value is the sparsity of input. You can see that different clusters of sparsity persist for quite a while, likely because they are shuffling from the same chunk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle buffer for controlling sample correlation #797

🚀 Feature

Motivation

My case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shuffle buffer for controlling sample correlation #797

Description

🚀 Feature

Motivation

My case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions