-
Notifications
You must be signed in to change notification settings - Fork 90
Description
🐛 Bug
It is not possible to use a custom index_path with StreamingDataset when input_dir is a local path. This appears to be caused by a check in subsample_streaming_dataset that determines whether input_dir is a URL:
| if not os.path.exists(cache_index_filepath) and isinstance(input_dir.url, str): |
To Reproduce
As a simple example, download a small dataset from Hugging Face as Parquet files and index it with index_parquet_dataset:
from huggingface_hub import snapshot_download
import litdata as ld
def main():
snapshot_download(repo_id="roneneldan/TinyStories", local_dir="tiny_stories_data", repo_type="dataset")
ld.index_parquet_dataset("tiny_stories_data/data", cache_dir="my-custom-cache")
if __name__ == "__main__":
main()Then create a StreamingDataset with:
dataset = StreamingDataset("tiny_stories_data/data", shuffle=True, index_path="my-custom-cache/index.json", item_loader=ParquetLoader())This raises:
ValueError: The provided dataset `tiny_stories_data/data` doesn't contain any index.json file.
HINT: Did you successfully optimize a dataset to the provided `input_dir`?Expected behavior
The index_path docstring states:
Path to index.json for the Parquet dataset. If index_path is a directory, the function will look for index.json within it. If index_path is a full file path, it will use that directly.
Therefore, to my understanding, StreamingDataset should accept an index_path that points to an index.json located outside input_dir (for example, in a separate folder) and use that index regardless of whether input_dir is a local path or a URL.
Additional context
The issue can be worked around by prefixing the path with local:, but the documentation does not make it clear whether that prefix should be required. Thank you in advance!