Skip to content

Commit 7741854

Browse files
committed
update readme
1 parent 66fce87 commit 7741854

File tree

1 file changed

+86
-1
lines changed

1 file changed

+86
-1
lines changed

README.md

Lines changed: 86 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,87 @@
11
# pyspark_huggingface
2-
PySpark custom data source for Hugging Face Datasets
2+
3+
<p align="center">
4+
<a href="https://github.com/huggingface/pyspark_huggingface/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg"></a>
5+
<a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a>
6+
</p>
7+
8+
A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets):
9+
10+
- Stream datasets directly from Hugging Face to your Spark application
11+
- Select subsets and splits
12+
- Apply projection and predicate filters for Parquet datasets
13+
- Push Spark DataFrames as Parquet files the Hugging Face Dataset Hub
14+
- Fully distributed
15+
- Authentication via `huggingface-cli login` or tokens
16+
- Compatible with Spark 4 (with auto-import)
17+
- Backport for Spark 3.5, 3.4 and 3.3
18+
19+
## Installation
20+
21+
```
22+
pip install pyspark_huggingface
23+
```
24+
25+
## Usage
26+
27+
Load a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)):
28+
29+
```python
30+
df = spark.read.format("huggingface").load("stanfordnlp/imdb")
31+
```
32+
33+
Save to Hugging Face:
34+
35+
```python
36+
# Login with huggingface-cli login
37+
df.write.format("huggingface").save("username/my_dataset")
38+
# Or pass a token manually
39+
df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")
40+
```
41+
42+
## Advanced
43+
44+
Select a split:
45+
46+
```python
47+
test_df = (
48+
spark.read.format("huggingface")
49+
.option("split", "test")
50+
.load("stanfordnlp/imdb")
51+
)
52+
```
53+
54+
Select a subset/config:
55+
56+
```python
57+
test_df = (
58+
spark.read.format("huggingface")
59+
.option("config", "sample-10BT")
60+
.load("HuggingFaceFW/fineweb-edu")
61+
)
62+
```
63+
64+
Filters columns and rows (especially efficient for Parquet datasets):
65+
66+
```python
67+
df = (
68+
spark.read.format("huggingface")
69+
.option("filters", '[("language_score", ">", 0.99)]')
70+
.option("columns", '["text", "language_score"]')
71+
.load("HuggingFaceFW/fineweb-edu")
72+
)
73+
```
74+
75+
## Backport
76+
77+
While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.
78+
79+
Importing `pyspark_huggingface` patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:
80+
81+
```python
82+
>>> import pyspark_huggingface
83+
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)
84+
```
85+
86+
The import is only necessary on Spark 3.x to enable the backport.
87+
Spark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the "huggingface" data source.

0 commit comments

Comments
 (0)