Skip to content

index_path is ignored in StreamingDataset when input_dir is a local directory #800

@albertoveneri

Description

@albertoveneri

🐛 Bug

It is not possible to use a custom index_path with StreamingDataset when input_dir is a local path. This appears to be caused by a check in subsample_streaming_dataset that determines whether input_dir is a URL:

if not os.path.exists(cache_index_filepath) and isinstance(input_dir.url, str):

To Reproduce

As a simple example, download a small dataset from Hugging Face as Parquet files and index it with index_parquet_dataset:

from huggingface_hub import snapshot_download
import litdata as ld

def main():
    snapshot_download(repo_id="roneneldan/TinyStories", local_dir="tiny_stories_data", repo_type="dataset")
    ld.index_parquet_dataset("tiny_stories_data/data", cache_dir="my-custom-cache")


if __name__ == "__main__":
    main()

Then create a StreamingDataset with:

dataset = StreamingDataset("tiny_stories_data/data", shuffle=True, index_path="my-custom-cache/index.json", item_loader=ParquetLoader())

This raises:

ValueError: The provided dataset `tiny_stories_data/data` doesn't contain any index.json file.
 HINT: Did you successfully optimize a dataset to the provided `input_dir`?

Expected behavior

The index_path docstring states:

Path to index.json for the Parquet dataset. If index_path is a directory, the function will look for index.json within it. If index_path is a full file path, it will use that directly.

Therefore, to my understanding, StreamingDataset should accept an index_path that points to an index.json located outside input_dir (for example, in a separate folder) and use that index regardless of whether input_dir is a local path or a URL.

Additional context

The issue can be worked around by prefixing the path with local:, but the documentation does not make it clear whether that prefix should be required. Thank you in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions