huggingface
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 42 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎.python-version‎
Lines changed: 1 addition & 0 deletions b/‎.python-version‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 104 additions & 2 deletions b/‎README.md‎
Lines changed: 104 additions & 2 deletions
@@ -0,0 +1,42 @@
+name: CI
+
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ['3.9', '3.13']
+        packages: [['pyspark>=4.0.0'], ['pyspark==3.5.6', 'numpy<2.0.0']]
+        exclude:
+          - python-version: '3.13'
+            packages: ['pyspark==3.5.6', 'numpy<2.0.0']
+      fail-fast: false
+
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v5
+      with:
+        python-version: ${{ matrix.python-version }}
+
+    - name: Install uv
+      run: |
+        curl -LsSf https://astral.sh/uv/install.sh | sh
+        echo "$HOME/.cargo/bin" >> $GITHUB_PATH
+
+    - name: Install dependencies
+      run: |
+        echo "${{ matrix.python-version }}" > .python-version
+        uv add --dev "${{ join(matrix.packages, '" "') }}"
+        uv sync
+
+    - name: Run tests
+      run: |
+        uv run pytest
@@ -0,0 +1 @@
+3.9
@@ -1,2 +1,104 @@
-# pyspark_huggingface
-PySpark custom data source for Hugging Face Datasets
+<p align="center">
+  <img alt="Hugging Face x Spark" src="https://pbs.twimg.com/media/FvN1b_2XwAAWI1H?format=jpg&name=large" width="352" style="max-width: 100%;">
+  <br/>
+  <br/>
+</p>
+
+<p align="center">
+    <a href="https://github.com/huggingface/pyspark_huggingface/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg"></a>
+    <a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a>
+</p>
+
+# Spark Data Source for Hugging Face Datasets
+
+A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets):
+
+- Stream datasets from Hugging Face as Spark DataFrames
+- Select subsets and splits, apply projection and predicate filters
+- Save Spark DataFrames as Parquet files to Hugging Face
+- Fully distributed
+- Authentication via `huggingface-cli login` or tokens
+- Compatible with Spark 4 (with auto-import)
+- Backport for Spark 3.5, 3.4 and 3.3
+
+## Installation
+
+```
+pip install pyspark_huggingface
+```
+
+## Usage
+
+Load a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)):
+
+```python
+import pyspark_huggingface
+df = spark.read.format("huggingface").load("stanfordnlp/imdb")
+```
+
+Save to Hugging Face:
+
+```python
+# Login with huggingface-cli login
+df.write.format("huggingface").save("username/my_dataset")
+# Or pass a token manually
+df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")
+```
+
+## Advanced
+
+Select a split:
+
+```python
+test_df = (
+    spark.read.format("huggingface")
+    .option("split", "test")
+    .load("stanfordnlp/imdb")
+)
+```
+
+Select a subset/config:
+
+```python
+test_df = (
+    spark.read.format("huggingface")
+    .option("config", "sample-10BT")
+    .load("HuggingFaceFW/fineweb-edu")
+)
+```
+
+Filters columns and rows (especially efficient for Parquet datasets):
+
+```python
+df = (
+    spark.read.format("huggingface")
+    .option("filters", '[("language_score", ">", 0.99)]')
+    .option("columns", '["text", "language_score"]')
+    .load("HuggingFaceFW/fineweb-edu")
+)
+```
+
+## Backport
+
+While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.
+
+Importing `pyspark_huggingface` patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:
+
+```python
+>>> import pyspark_huggingface
+huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)
+```
+
+The import is only necessary on Spark 3.x to enable the backport.
+Spark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the "huggingface" data source.
+
+
+## Development
+
+[Install uv](https://docs.astral.sh/uv/getting-started/installation/) if not already done.
+
+Then, from the project root directory, sync dependencies and run tests.
+```
+uv sync
+uv run pytest
+```