|
1 | 1 | # pyspark_huggingface |
2 | | -PySpark custom data source for Hugging Face Datasets |
| 2 | + |
| 3 | +<p align="center"> |
| 4 | + <a href="https://github.com/huggingface/pyspark_huggingface/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg"></a> |
| 5 | + <a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a> |
| 6 | +</p> |
| 7 | + |
| 8 | +A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets): |
| 9 | + |
| 10 | +- Stream datasets directly from Hugging Face to your Spark application |
| 11 | +- Select subsets and splits |
| 12 | +- Apply projection and predicate filters for Parquet datasets |
| 13 | +- Push Spark DataFrames as Parquet files the Hugging Face Dataset Hub |
| 14 | +- Fully distributed |
| 15 | +- Authentication via `huggingface-cli login` or tokens |
| 16 | +- Compatible with Spark 4 (with auto-import) |
| 17 | +- Backport for Spark 3.5, 3.4 and 3.3 |
| 18 | + |
| 19 | +## Installation |
| 20 | + |
| 21 | +``` |
| 22 | +pip install pyspark_huggingface |
| 23 | +``` |
| 24 | + |
| 25 | +## Usage |
| 26 | + |
| 27 | +Load a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)): |
| 28 | + |
| 29 | +```python |
| 30 | +df = spark.read.format("huggingface").load("stanfordnlp/imdb") |
| 31 | +``` |
| 32 | + |
| 33 | +Save to Hugging Face: |
| 34 | + |
| 35 | +```python |
| 36 | +# Login with huggingface-cli login |
| 37 | +df.write.format("huggingface").save("username/my_dataset") |
| 38 | +# Or pass a token manually |
| 39 | +df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset") |
| 40 | +``` |
| 41 | + |
| 42 | +## Advanced |
| 43 | + |
| 44 | +Select a split: |
| 45 | + |
| 46 | +```python |
| 47 | +test_df = ( |
| 48 | + spark.read.format("huggingface") |
| 49 | + .option("split", "test") |
| 50 | + .load("stanfordnlp/imdb") |
| 51 | +) |
| 52 | +``` |
| 53 | + |
| 54 | +Select a subset/config: |
| 55 | + |
| 56 | +```python |
| 57 | +test_df = ( |
| 58 | + spark.read.format("huggingface") |
| 59 | + .option("config", "sample-10BT") |
| 60 | + .load("HuggingFaceFW/fineweb-edu") |
| 61 | +) |
| 62 | +``` |
| 63 | + |
| 64 | +Filters columns and rows (especially efficient for Parquet datasets): |
| 65 | + |
| 66 | +```python |
| 67 | +df = ( |
| 68 | + spark.read.format("huggingface") |
| 69 | + .option("filters", '[("language_score", ">", 0.99)]') |
| 70 | + .option("columns", '["text", "language_score"]') |
| 71 | + .load("HuggingFaceFW/fineweb-edu") |
| 72 | +) |
| 73 | +``` |
| 74 | + |
| 75 | +## Backport |
| 76 | + |
| 77 | +While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions. |
| 78 | + |
| 79 | +Importing `pyspark_huggingface` patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3: |
| 80 | + |
| 81 | +```python |
| 82 | +>>> import pyspark_huggingface |
| 83 | +huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4) |
| 84 | +``` |
| 85 | + |
| 86 | +The import is only necessary on Spark 3.x to enable the backport. |
| 87 | +Spark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the "huggingface" data source. |
0 commit comments