`ParallelStreamingDataset` does not support `StreamingDataset` returning more than one sample per iteration

## 🚀 Feature

Hi Litdata maintainers. Is there any plan to make the `StreamingDataset` wrappers, such as `ParallelStreamingDataset`, able to handle `StreamingDataset`s that return more than one sample per iteration? This would be useful when performing aggregation operations on the fly, such as sequence packing. More details below.

### Motivation

If we want to perform an aggregation operation on the fly, such as packing samples from a `StreamingDataset`, we would return multiple samples for a single `next()` iteration. However, certain wrapper iterators, such as the [`_ParallelDatasetIterator`](https://github.com/Lightning-AI/litData/blob/36431bde711c17b92d6f67f9ea92fa5bb47ba739/src/litdata/streaming/parallel.py#L315), increment the number of samples yielded by only 1 at each iteration:

https://github.com/Lightning-AI/litData/blob/36431bde711c17b92d6f67f9ea92fa5bb47ba739/src/litdata/streaming/parallel.py#L365

### Pitch

Like `_ParallelDatasetIterator` returns to callers of `next` a list containing the number of samples yielded for each dataset:

https://github.com/Lightning-AI/litData/blob/36431bde711c17b92d6f67f9ea92fa5bb47ba739/src/litdata/streaming/parallel.py#L372

it would be useful to use a similar reserved key to pass the number of samples yielded by a `StreamingDataset` to the `_ParallelDatasetIterator` by modifying

https://github.com/Lightning-AI/litData/blob/36431bde711c17b92d6f67f9ea92fa5bb47ba739/src/litdata/streaming/parallel.py#L365

with something like:

`self._num_samples_yielded[i] = sample.get("__NUM_SAMPLES_YIELDED_KEY__", 1) if _reset else self._num_samples_yielded[i] + sample.get("__NUM_SAMPLES_YIELDED_KEY__", 1)`

### Alternatives

The aggregation can also be done during the optimization step, at the cost of rebuilding the dataset each time we want to change the aggregation.

Thank you in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ParallelStreamingDataset` does not support `StreamingDataset` returning more than one sample per iteration #801

🚀 Feature

Motivation

Pitch

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ParallelStreamingDataset does not support StreamingDataset returning more than one sample per iteration #801

Description

🚀 Feature

Motivation

Pitch

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`ParallelStreamingDataset` does not support `StreamingDataset` returning more than one sample per iteration #801