A complete, self-contained Extralit deployment bundle designed for easy deployment on HuggingFace Spaces. This package includes everything needed to run Extralit with PDF text extraction capabilities, including bundled Elasticsearch, Redis, and PyMuPDF-powered OCR processing.
The recommended way to get started with Extralit - get up and running in under 5 minutes without maintaining servers or running commands.
Click the "Deploy to Spaces" button above to create your own Extralit instance. You can use the default values, but for persistent data, you'll need to configure:
- Persistent Storage: Set to
SMALL(otherwise data is lost on Space restart) - Database:
EXTRALIT_DATABASE_URL- PostgreSQL connection string - File Storage: S3-compatible storage credentials:
S3_ENDPOINTS3_ACCESS_KEYS3_SECRET_KEY
OAUTH2_HUGGINGFACE_CLIENT_IDOAUTH2_HUGGINGFACE_CLIENT_SECRET
Leave ADMIN_USERNAME and ADMIN_PASSWORD empty - you'll sign in with your HF account as the Space owner.
Alternatively, deploy programmatically:
import extralit as ex
# Automatically creates and configures your HF Space
authenticated_client = ex.Extralit.deploy_on_spaces(
api_key="your_hf_token"
)This method automatically:
- Creates a Space at
https://<your-username>-extralit.hf.space - Sets up OAuth authentication
- Creates a default workspace
- Returns an authenticated client ready to use
This HF Space package includes a complete Extralit stack:
- Extralit Server: Full annotation and dataset management platform
- PDF Text Extraction: PyMuPDF-powered hierarchical markdown extraction
- Search & Analytics: Elasticsearch 8.x for full-text search
- Background Processing: Redis + RQ workers for async tasks
- Authentication: HuggingFace OAuth integration
extralit-hf-space/
├── extralit_ocr/ # PDF extraction service
│ ├── extract.py # PyMuPDF markdown extraction
│ ├── jobs.py # RQ worker jobs
│ └── schemas.py # API schemas
├── Dockerfile # Multi-service container
├── Procfile # Process orchestration
├── scripts/start.sh # HF Space startup script
└── config/
└── elasticsearch.yml # Elasticsearch configuration
The Space automatically configures itself, but you can customize:
OAUTH2_HUGGINGFACE_CLIENT_ID- HF OAuth app IDOAUTH2_HUGGINGFACE_CLIENT_SECRET- HF OAuth secretOAUTH2_HUGGINGFACE_SCOPE- OAuth permissions
EXTRALIT_DATABASE_URL- PostgreSQL connection stringS3_ENDPOINT- S3-compatible storage endpointS3_ACCESS_KEY- Storage access keyS3_SECRET_KEY- Storage secret key
PDF_MARKDOWN_WRITE_DIR- Directory for extracted markdown filesPDF_MARKDOWN_WRITE_MODE-overwriteorskipexisting files
- Navigate to your Space URL:
https://<username>-extralit.hf.space - Click "Sign in with Hugging Face"
- Authorize the application - you'll be logged in as the Space owner
Import from Hugging Face Hub:
- In the Home page, click "Import dataset from Hugging Face"
- Choose a sample dataset or enter a repo ID (e.g.,
stanfordnlp/imdb) - Configure fields and questions as needed
- Give your dataset a name and start importing
Using the Python SDK:
import extralit as ex
# Connect to your Space
client = ex.client(
api_url="https://<username>-extralit.hf.space",
api_key="your_api_key" # Found in My Settings
)
# Verify connection
print(client.me)
# Create a dataset
dataset = client.datasets.create(
name="my_dataset",
schema=my_schema
)The bundled OCR service automatically processes PDF uploads:
- Hierarchical Extraction: Uses PyMuPDF to extract structured markdown
- Header Detection: Automatically identifies document structure
- Background Processing: Large files processed asynchronously via RQ workers
Export your annotated datasets back to the Hub:
# Load your dataset
dataset = client.datasets(name="my_dataset")
# Export to HuggingFace Hub
dataset.to_hub(repo_id="username/my-annotated-dataset")For local development or custom deployments:
# Clone this repository
git clone https://github.com/extralit/extralit-hf-space.git
cd extralit-hf-space
# Build the container
docker build -t extralit-hf-space .
# Run with docker-compose or standalone
docker run -p 80:80 extralit-hf-space- Learn More: Extralit Documentation
- Tutorials: Hands-on Examples
- Advanced Setup: HF Spaces Configuration Guide
This repository is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) due to the inclusion of PyMuPDF. The AGPL-licensed components are fully isolated in this package, allowing the main Extralit server to remain Apache-2.0 licensed.