A local reranker service with a Jina compatible API.
This project provides a FastAPI-based web service that implements a reranking API endpoint (/v1/rerank) compatible with the Jina AI Rerank API. It allows you to host a reranking model entirely on your own infrastructure for enhanced privacy and performance.
- Jina Compatible API: Implements
/v1/rerankendpoint structure - Local Hosting: Run reranker model entirely on your own infrastructure
- Multiple Backends: Supports both PyTorch and MLX backends for optimal performance
- Apple Silicon Optimization: MLX backend provides optimized performance for M1/M2/M3 chips
- MLX Fallback Reranker: Automatically wraps MLX-converted Hugging Face models that do not ship a
rerank.pyhelper - Sentence Transformers: Uses powerful
sentence-transformerslibrary for PyTorch backend - Configurable Model: Easily switch between different reranker models and backends
- Modern FastAPI: Built using modern FastAPI features like
lifespanfor resource management - Modern Dependencies: Updated to latest stable versions with sensible minimum requirements
- Python 3.12+
- uv (for installation and package management - recommended)
- Sufficient RAM and compute resources (CPU or GPU) depending on the chosen reranker model
PyTorch Backend:
- PyTorch 2.5+ (automatically installed)
- CUDA/MPS support for GPU acceleration (optional)
MLX Backend (Apple Silicon only):
- Apple Silicon (M1/M2/M3) Mac
- MLX and MLX-LM libraries (automatically installed)
- Optimized for memory efficiency and performance on Apple chips
# Clone the repository
git clone https://github.com/olafgeibig/local-reranker.git
cd local-reranker
# Create virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -e ".[dev]"The CLI supports both modern subcommands and legacy arguments for backward compatibility.
# Start server with subcommand
cli serve --backend <backend_type> [options]
# Show configuration
cli config show# Old-style arguments still work
cli --backend <backend_type> --model <model> --host <host> --port <port>pytorch: PyTorch-based reranker (default, cross-platform)mlx: MLX-based reranker (Apple Silicon optimized)
--backend: Backend type to use (default: pytorch)--model: Model name to use (overrides reranker default)--host: Host to bind server to (default: 0.0.0.0)--port: Port to bind server to (default: 8010)--log-level: Uvicorn log level (debug, info, warning, error, critical; default: info)--reload: Enable auto-reload for development
PyTorch Backend (default):
cli serve --backend pytorch --model jinaai/jina-reranker-v2-base-multilingualMLX Backend (Apple Silicon):
cli serve --backend mlx --model jinaai/jina-reranker-v3-mlxDevelopment Mode:
cli serve --backend pytorch --reload --log-level debugConfiguration Management:
cli config showOnce the server is running, you can send requests to the /v1/rerank endpoint. Here's an example using curl:
curl -X POST "http://localhost:8010/v1/rerank" \
-H "Content-Type: application/json" \
-d '{
"model": "jina-reranker-v2-base-multilingual",
"query": "What are the benefits of using FastAPI?",
"documents": [
"FastAPI is a modern, fast (high-performance) web framework for building APIs with Python 3.7+ based on standard Python type hints.",
"Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.",
"The key features are: Fast, Fast to code, Fewer dependencies, Intuitive, Easy, Short, Robust, Standards-based.",
"Flask is a micro web framework written in Python."
],
"top_n": 3,
"return_documents": true
}'model: (Currently ignored by API, uses the configured default) The name of the reranker modelquery: The search query stringdocuments: A list of strings or dictionaries ({"text": "..."}) to be reranked against the querytop_n: (Optional) The maximum number of results to returnreturn_documents: (Optional, defaultfalse) Whether to include document text in results
-
Clone the repository:
git clone https://github.com/olafgeibig/local-reranker.git cd local-reranker -
Create a virtual environment:
uv venv source .venv/bin/activate -
Install development dependencies:
uv pip install -e ".[dev]" -
Verify installation:
# Test CLI works cli config show # Test server starts cli serve --backend pytorch --help
Tests are implemented using pytest. To run tests:
# Ensure virtual environment is active
python -m pytest
# Or using uv run
uv run pytest
# Run specific test categories
uv run pytest -m "not integration" # Skip integration tests
uv run pytest -m "integration" # Only integration tests
uv run pytest -m "slow" # Only slow testsThe project uses modern development tools:
# Run linting
uv run ruff check
# Run type checking
uv run mypy src/
# Run both
uv run ruff check && uv run mypy src/MLX not found:
# Ensure you're on Apple Silicon
uname -m # Should show arm64
# Install MLX dependencies
uv add mlx mlx-lm safetensorsModel download fails:
# Check internet connection
# Try manual download
huggingface-cli download jinaai/jina-reranker-v3-mlxPerformance issues:
# Check MLX is using GPU (if available)
python -c "import mlx; print(mlx.metal.is_available())"
# Monitor memory usage
top -o mem | grep pythonThe MLX backend now includes an internal cross-encoder reranker that automatically loads any MLX-converted Hugging Face repository, even when the repo does not provide a rerank.py helper. When rerank.py is missing or cannot be imported, the server logs a message similar to Using internal cross-encoder fallback so you can confirm which path is active. If your model ships a projector.safetensors file, place it next to the weights—the fallback reranker will load it to project hidden states into the correct embedding space. When the projector file is absent, the reranker falls back to raw hidden states, so you can still experiment with newly converted models without extra work.
If you're having trouble with MLX backend configuration, try these explicit CLI commands:
# Force MLX backend with explicit model
cli serve --backend mlx --model jinaai/jina-reranker-v3-mlx
# Check current configuration
cli config show
# Use development mode for debugging
cli serve --backend mlx --reload --log-level debugModel download fails:
# Check internet connection
# Try manual download
huggingface-cli download jinaai/jina-reranker-v3-mlxPerformance issues:
# Check MLX is using GPU (if available)
python -c "import mlx; print(mlx.metal.is_available())"
# Monitor memory usage
top -o mem | grep pythonThe application uses pydantic-settings for configuration management. You can set the following environment variables to override defaults:
# Force MLX backend
export RERANKER_RERANKER_TYPE=mlx
# Custom model name
export RERANKER_MODEL_NAME=custom-mlx-model
# Custom host and port
export RERANKER_HOST=0.0.0.0
export RERANKER_PORT=8080
# Enable debug logging
export RERANKER_LOG_LEVEL=debug
# Enable auto-reload
export RERANKER_RELOAD=trueNote: Using the CLI command line options is recommended over environment variables for clarity.
local-reranker/
├── src/local_reranker/
│ ├── __init__.py
│ ├── api.py # FastAPI application
│ ├── cli.py # Command line interface
│ ├── config.py # Configuration management
│ ├── models.py # Pydantic models
│ ├── reranker.py # Base reranker interface (protocol)
│ ├── reranker_pytorch.py # PyTorch implementation
│ ├── reranker_mlx.py # MLX implementation
│ ├── batch_manager.py # Batch size and document management
│ ├── batch_processor.py # Shared batch processing logic
│ ├── result_aggregator.py # Result aggregation and ordering
│ ├── streaming_processor.py # Streaming result processing
│ ├── tokenization_cache.py # Tokenization caching
│ └── utils.py # Utility functions
├── tests/ # Test suite
├── pyproject.toml # Project configuration
└── README.md # This file
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.