This directory contains interactive Jupyter notebooks demonstrating Loclean's features.
We strongly recommend using Jupyter notebooks to run these examples:
- Interactive: Run cells individually and see results immediately
- Explorable: Modify code and experiment with different inputs
- Educational: See outputs, errors, and intermediate results
- Shareable: Easy to share with others
Loclean uses Ollama for local inference. Install it once:
# Linux / WSL
curl -fsSL https://ollama.com/install.sh | sh
# macOS
brew install ollama
# All platforms → https://ollama.com/downloadNote: You do not need to manually start the daemon or pull models. Loclean handles both automatically on first use:
- Auto-start — if the Ollama daemon is not running, Loclean launches it for you.
- Auto-pull — if the requested model is missing, Loclean downloads it with a progress bar.
# Install Jupyter
pip install jupyter
# Start Jupyter
jupyter notebook
# Or use JupyterLab
pip install jupyterlab
jupyter labThen open any .ipynb file in the browser.
VS Code has built-in Jupyter notebook support. Just open any .ipynb file.
Upload any .ipynb file to Google Colab and run it there.
Start here! Core features and basic usage:
- Structured extraction with Pydantic
- Data cleaning with dataframes
- Privacy scrubbing
- Working with Pandas/Polars
Comprehensive data cleaning examples:
- Basic usage and custom instructions
- Working with different backends
- Batch and parallel processing
- Handling missing values
- Model selection
Privacy-first PII scrubbing:
- Mask and replace modes
- Selective scrubbing strategies
- Locale support
- Before/after examples
For more advanced features like model selection, caching strategies, and performance optimization, check out the full documentation.
Advanced structured extraction:
- Complex nested schemas
- Union types
- Error handling and retries
- Performance optimization
- Caching demonstrations
Debugging and detailed logging:
- Enabling verbose mode
- Seeing raw LLM prompts and outputs
- Debugging Pydantic validation issues
- Global configuration via environment variables
Entity resolution — canonicalize messy string variations:
- Merge company-name typos, abbreviations, casing
- Configurable similarity threshold
- Before/after comparison
Semantic oversampling for imbalanced datasets:
- Pydantic-schema-driven synthetic record generation
- Minority-class augmentation
- Class distribution balancing
Log shredding — parse unstructured logs into relational tables:
- Mixed log format parsing (auth, API, payment, inventory, ML)
- Automatic schema inference
- One column → multiple normalized DataFrames
Automated feature discovery:
- LLM-proposed mathematical transformations
- Housing price dataset example
- Mutual information maximisation with target variable
Data quality validation with natural-language rules:
- Plain-English constraint definitions
- Structured compliance reports
- Multi-rule evaluation
🏠 Data Science — Kaggle-style housing prediction workflow:
- Clean messy strings → entity resolution → feature discovery
- Minority-class oversampling → quality validation → PII scrubbing
- Full pipeline with
qwen2.5-coder:1.5b
🔧 Data Engineering — log processing and warehouse loading:
- Structured extraction with Pydantic schemas
- Compiled extraction for high-performance parsing
- Log shredding into relational tables → quality gates → PII masking
Trap feature detection and removal:
- Statistical profiling of numeric columns
- LLM-verified Gaussian noise detection
- Before/after column comparison with verdicts
Missing Not At Random (MNAR) pattern detection:
- Detect informative missingness patterns
- Automatic boolean feature flag encoding
- Clinical dataset example (income ↔ employment)
Target leakage detection and removal:
- Semantic timeline evaluation per column
- Domain-aware reasoning (loan approval example)
- Automatic removal of leaked features
Reward-driven prompt optimization:
- Generates structural instruction variations
- Scores each against validation sample (field-level F1)
- Returns the best-performing extraction instruction
| Script | Description |
|---|---|
benchmark.py |
Performance benchmark: vectorized dedup + cache speedup on 100K rows |
eval_demo.py |
Evaluation framework demo with optional Langfuse tracking |
This directory contains:
*.ipynb: Jupyter notebooks demonstrating specific features. Numbered prefixes indicate recommended reading order.benchmark.py: Performance benchmarking script.eval_demo.py: Evaluation framework demo.README.md: This file.
# Install Loclean (Ollama daemon is started automatically)
pip install loclean
# For privacy scrubbing with fake data replacement
pip install loclean[privacy]
# For Jupyter notebooks
pip install jupyter
# Optional: For better performance
pip install polars pandasLoclean auto-pulls models on first use. You can also manage models explicitly via the CLI:
# Check daemon status and list local models
loclean model status
# Pull a specific model ahead of time
loclean model pull phi3
loclean model pull llama3- Start Jupyter:
jupyter notebookorjupyter lab - Open a notebook: Click on any
.ipynbfile - Run cells: Press
Shift+Enterto run a cell - Experiment: Modify code and see results
- First time? Start with
01-quick-start.ipynb - Need help? Check the full documentation
- Model auto-pull: First run auto-downloads the default model (one-time, ~2 GB). Change models with
loclean.clean(..., model="llama3")or setLOCLEAN_MODEL=llama3. - Caching: Results are cached, so re-running cells is fast
- Errors? Check that you have Ollama installed (
ollama --version) and the required Python dependencies
- Full Documentation: https://nxank4.github.io/loclean
- GitHub Repository: https://github.com/nxank4/loclean
- PyPI Package: https://pypi.org/project/loclean
Found a bug or want to add an example? Please open an issue or pull request on GitHub!
When adding a new example notebook:
- Naming convention: Use numbered prefixes (e.g.,
06-new-feature.ipynb) to maintain order - Structure: Follow the pattern of existing notebooks:
- Start with a clear title and description
- Include installation/setup cells
- Provide clear explanations in markdown cells
- Show expected outputs
- Dependencies: Document any special dependencies in the notebook's first cell
- Testing: Ensure all cells run successfully before submitting
- Documentation: Update this README to include your new notebook in the "Available Notebooks" section
- Keep examples simple: Focus on demonstrating one feature or concept per notebook
- Use real-world scenarios: Make examples relatable and practical
- Document assumptions: Clearly state any prerequisites or assumptions
- Test thoroughly: Ensure all cells execute without errors
- Follow code style: Use type hints and follow PEP 8 (enforced by
ruff) - Update this README: When adding new notebooks, update the "Available Notebooks" section above
The benchmark.py script is used for performance testing. When modifying it:
- Keep it focused on performance metrics
- Document what is being benchmarked
- Ensure it runs without errors
- Update this README if the script's purpose changes significantly