Skip to content

Commit 01e3d45

Browse files
authored
Merge branch 'main' into patch-1
2 parents 0f49456 + 861fec3 commit 01e3d45

File tree

14 files changed

+2594
-134
lines changed

14 files changed

+2594
-134
lines changed

.github/workflows/ci.yml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
pull_request:
7+
branches: [ main ]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: ['3.9', '3.13']
15+
packages: [['pyspark>=4.0.0'], ['pyspark==3.5.6', 'numpy<2.0.0']]
16+
exclude:
17+
- python-version: '3.13'
18+
packages: ['pyspark==3.5.6', 'numpy<2.0.0']
19+
fail-fast: false
20+
21+
steps:
22+
- uses: actions/checkout@v4
23+
24+
- name: Set up Python ${{ matrix.python-version }}
25+
uses: actions/setup-python@v5
26+
with:
27+
python-version: ${{ matrix.python-version }}
28+
29+
- name: Install uv
30+
run: |
31+
curl -LsSf https://astral.sh/uv/install.sh | sh
32+
echo "$HOME/.cargo/bin" >> $GITHUB_PATH
33+
34+
- name: Install dependencies
35+
run: |
36+
echo "${{ matrix.python-version }}" > .python-version
37+
uv add --dev "${{ join(matrix.packages, '" "') }}"
38+
uv sync
39+
40+
- name: Run tests
41+
run: |
42+
uv run pytest

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.9

README.md

Lines changed: 104 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,104 @@
1-
# pyspark_huggingface
2-
PySpark custom data source for Hugging Face Datasets
1+
<p align="center">
2+
<img alt="Hugging Face x Spark" src="https://pbs.twimg.com/media/FvN1b_2XwAAWI1H?format=jpg&name=large" width="352" style="max-width: 100%;">
3+
<br/>
4+
<br/>
5+
</p>
6+
7+
<p align="center">
8+
<a href="https://github.com/huggingface/pyspark_huggingface/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg"></a>
9+
<a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a>
10+
</p>
11+
12+
# Spark Data Source for Hugging Face Datasets
13+
14+
A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets):
15+
16+
- Stream datasets from Hugging Face as Spark DataFrames
17+
- Select subsets and splits, apply projection and predicate filters
18+
- Save Spark DataFrames as Parquet files to Hugging Face
19+
- Fully distributed
20+
- Authentication via `huggingface-cli login` or tokens
21+
- Compatible with Spark 4 (with auto-import)
22+
- Backport for Spark 3.5, 3.4 and 3.3
23+
24+
## Installation
25+
26+
```
27+
pip install pyspark_huggingface
28+
```
29+
30+
## Usage
31+
32+
Load a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)):
33+
34+
```python
35+
import pyspark_huggingface
36+
df = spark.read.format("huggingface").load("stanfordnlp/imdb")
37+
```
38+
39+
Save to Hugging Face:
40+
41+
```python
42+
# Login with huggingface-cli login
43+
df.write.format("huggingface").save("username/my_dataset")
44+
# Or pass a token manually
45+
df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")
46+
```
47+
48+
## Advanced
49+
50+
Select a split:
51+
52+
```python
53+
test_df = (
54+
spark.read.format("huggingface")
55+
.option("split", "test")
56+
.load("stanfordnlp/imdb")
57+
)
58+
```
59+
60+
Select a subset/config:
61+
62+
```python
63+
test_df = (
64+
spark.read.format("huggingface")
65+
.option("config", "sample-10BT")
66+
.load("HuggingFaceFW/fineweb-edu")
67+
)
68+
```
69+
70+
Filters columns and rows (especially efficient for Parquet datasets):
71+
72+
```python
73+
df = (
74+
spark.read.format("huggingface")
75+
.option("filters", '[("language_score", ">", 0.99)]')
76+
.option("columns", '["text", "language_score"]')
77+
.load("HuggingFaceFW/fineweb-edu")
78+
)
79+
```
80+
81+
## Backport
82+
83+
While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.
84+
85+
Importing `pyspark_huggingface` patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:
86+
87+
```python
88+
>>> import pyspark_huggingface
89+
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)
90+
```
91+
92+
The import is only necessary on Spark 3.x to enable the backport.
93+
Spark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the "huggingface" data source.
94+
95+
96+
## Development
97+
98+
[Install uv](https://docs.astral.sh/uv/getting-started/installation/) if not already done.
99+
100+
Then, from the project root directory, sync dependencies and run tests.
101+
```
102+
uv sync
103+
uv run pytest
104+
```

0 commit comments

Comments
 (0)