Add Global max_cache_size to CombinedStreamingDataset to Enforce a Total Cache Budget Across Child Datasets

## 🚀 Feature

Add an optional **`max_cache_size` at the `CombinedStreamingDataset` level** that enforces a **single total cache budget across all child `StreamingDataset`s**, instead of only per-dataset cache limits.

Today, `max_cache_size` is only defined on `StreamingDataset` and passed into `Cache -> BinaryReader -> PrepareChunksThread`, where eviction is triggered based on the **size of that dataset's cache directory**.

### Motivation

When composing many streaming datasets (e.g. 50+), each `StreamingDataset` can independently grow its cache up to its own `max_cache_size` (default is `"100GB"`).

In a `CombinedStreamingDataset`, this can easily lead to runaway disk usage because eviction is enforced per dataset cache dir, not across the combined set:

-   `CombinedStreamingDataset` simply holds a list of `StreamingDataset`s and instantiates iterators for each
    (`self._dataset_iters = [iter(dataset) for dataset in datasets]`).
-   Each `StreamingDataset` lazily creates its own `Cache(...)` with `max_cache_size=self.max_cache_size`.
-   Eviction happens when `_get_folder_size(self._config._cache_dir, ...) >= self._max_cache_size` inside `PrepareChunksThread`.

So, with N datasets you can effectively consume \~N × max_cache_size on local disk. With 50 datasets and defaults, that's an upper bound on the order of terabytes (even if you "only meant" to budget \~100GB
total).

This is especially painful in multi-node / shared environments where local scratch space is limited, and it's easy to miss because each dataset looks "correctly configured" in isolation.

### Pitch

Introduce `max_cache_size` (and optionally an allocation strategy) on `CombinedStreamingDataset`, and have it enforce a global cache budget by distributing that budget across the child datasets before their caches are instantiated.

Proposed API:

``` python
CombinedStreamingDataset(
    datasets=[...],
    seed=42,
    iterate_over_all=True,
    batching_method="stratified",
    max_cache_size="200GB",                # NEW: total budget across all datasets
    cache_allocation="proportional",       # optional: "equal" | "proportional"
)
```

Behavior:

1.  If `max_cache_size` is not provided: keep current behavior (backward compatible).
2.  If provided: compute a per-dataset budget and apply it to each child dataset's `StreamingDataset.max_cache_size` *before* calling `iter(dataset)` (since iterator construction triggers cache creation).

Allocation strategies:

-   `equal`: `per_ds = total_budget / num_datasets`
-   `proportional` (recommended default): allocate by combined sampling weights (already computed in `CombinedStreamingDataset.__init__`).
    -   e.g. `per_ds_i = total_budget * weight_i`
    -   If a dataset is sampled more often, it benefits more from cache headroom.

Implementation details (minimal-intrusion approach):

-   Add `max_cache_size: int | str | None = None` and `cache_allocation: Literal["equal","proportional"] = "proportional"` to `CombinedStreamingDataset.__init__`.
-   In `CombinedStreamingDataset.__iter__`, before creating `_CombinedDatasetIterator`, apply the computed per-dataset budget:
    -   Set `dataset.max_cache_size = per_ds_budget` for each `StreamingDataset` child.
    -   `StreamingDataset` uses `self.max_cache_size` when constructing `Cache(...)`, which passes it down to `BinaryReader(... max_cache_size=...)` (the thing `PrepareChunksThread` uses for eviction).

Edge cases / notes:

-   If a child dataset already has a user-specified `max_cache_size`, you can either:
    -   override it when combined-level budget is set, or
    -   allow optional  `cache_allocation="cap_only"` semantics (combined-level acts as an upper cap but doesn't override smaller values). 
-   If `iterate_over_all=True` and datasets are removed/re-added during iteration, you could keep it simple and allocate once based on initial set; or recompute allocation when active datasets change.

Eviction is already implemented and robust at the per-directory level in `PrepareChunksThread`. It's just ensuring the sum of directories remains bounded by controlling per-directory budgets.

### Alternatives

1.  Manually set `max_cache_size` on every `StreamingDataset`
    -   Becomes brittle and tedious with large mixtures.
    -   Users still need to manually update configs when the number of datasets changes.
2.  Use a single shared `cache_dir`
    -   Risky: different datasets may clash in the same directory (and even if they don't today, it's not an advertised contract).
    -   Still doesn't solve budgeting unless eviction becomes global over that directory.
3.  Implement a true global eviction policy across dataset cache dirs
    -   More correct but significantly more complex:
        -   need global accounting of chunk sizes across dirs
        -   need a consistent definition of "oldest" or "least needed" across datasets
        -   need safe deletion under concurrent workers/processes (locks already exist per-chunk, but cross-dataset coordination is non-trivial).
    -   The proposed budget-distribution approach solves the practical disk-exhaustion problem with minimal churn.


### Additional context

Relevant code paths showing where the current per-dataset-only limit is
enforced:

-   `CombinedStreamingDataset` constructs iterators for each dataset (no
    cache budgeting logic at this level today).
-   `StreamingDataset` accepts `max_cache_size` (default `"100GB"`) and
    stores it.
-   `StreamingDataset._create_cache()` passes
    `max_cache_size=self.max_cache_size` into `Cache`.
-   `Cache` passes `max_cache_size` into `BinaryReader`, which spawns
    `PrepareChunksThread(max_cache_size=...)`.
-   `PrepareChunksThread` triggers deletion based on
    `_get_folder_size(self._config._cache_dir) >= self._max_cache_size`
    (per cache dir).

This feature would prevent accidental disk exhaustion in large multi-dataset training setups while remaining fully backward compatible and minimally invasive to the existing cache/eviction architecture.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Global max_cache_size to CombinedStreamingDataset to Enforce a Total Cache Budget Across Child Datasets #790

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Global max_cache_size to CombinedStreamingDataset to Enforce a Total Cache Budget Across Child Datasets #790

Description

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions