Hello, currently all the logic to load datasets is hard-coded in pipelinerl/domains/math/load_datasets.py. It would be nice if one could specify the train and test datasets to point directly to ones on the Hugging Face Hub, like this:
train_dataset_names:
- {ORG}/{TRAIN_DATASET_NAME}
test_dataset_names:
- {ORG}/{TEST_DATASET_NAME}
As long as the datasets are preprocessed in the expected format, we could skip the hard-coded logic in load_datasets.py altogether and make the framework a lot more flexible / user friendly :)