HonestRoles, developed by Hypertrial, is a Python package designed to transform raw job posting data into structured, scored, and searchable datasets.
- 🧹 Clean: HTML stripping, location normalization (city/region/country), salary parsing, and record deduplication.
- 🔍 Filter: High-performance
FilterChainwith predicates for location, salary, skills, and keyword matching. - 🏷️ Label: Automated seniority detection, role categorization, and tech stack extraction.
- ⭐️ Rate: Comprehensive job description scoring for completeness and quality.
- 🤖 LLM Integration: seamless integration with local Ollama models (e.g., Llama 3) for deep semantic analysis.
pip install honestrolesFor development:
git clone https://github.com/hypertrial/honestroles.git
cd honestroles
pip install -e ".[dev]"import honestroles as hr
from honestroles import schema
# Load raw job data (Parquet or DuckDB)
df = hr.read_parquet("jobs_current.parquet")
# 1. Clean and normalize data
df = hr.clean_jobs(df)
# 2. Apply complex filtering
chain = hr.FilterChain()
chain.add(hr.filter.by_location, regions=["California", "New York"])
chain.add(hr.filter.by_salary, min_salary=120_000, currency="USD")
chain.add(hr.filter.by_skills, required=["Python", "React"])
df = chain.apply(df)
# 3. Label roles (Heuristics + LLM)
df = hr.label_jobs(df, use_llm=True, model="llama3")
# 4. Rate job quality
df = hr.rate_jobs(df)
# Access data using schema constants
print(df[[schema.TITLE, schema.CITY, schema.COUNTRY]].head())
# Save structured results
hr.write_parquet(df, "jobs_scored.parquet")For source data, use contract normalization + validation before processing:
import honestroles as hr
df = hr.read_parquet("jobs_current.parquet", validate=False)
df = hr.normalize_source_data_contract(df)
df = hr.validate_source_data_contract(df)
df = hr.clean_jobs(df)
df = hr.filter_jobs(df, remote_only=True)
df = hr.label_jobs(df, use_llm=False)
df = hr.rate_jobs(df, use_llm=False)See /docs/start/quickstart.md and /docs/reference/source_data_contract_v1.md.
Documentation index: /docs/index.md.
Docs stack: /docs/maintainers/docs_stack.md.
Build docs locally:
pip install -e ".[docs]"
mkdocs serveDeploy docs on GitHub Pages:
- Ensure repository Settings -> Pages -> Build and deployment -> Source is set to GitHub Actions.
- Push to
mainto trigger.github/workflows/docs-pages.yml.
Always use honestroles.schema for consistent column referencing:
from honestroles import schema
# Available constants:
# schema.TITLE, schema.DESCRIPTION_TEXT, schema.COMPANY
# schema.CITY, schema.REGION, schema.COUNTRY
# schema.SALARY_MIN, schema.SALARY_MAX, etc.The FilterChain allows you to compose multiple filtering rules efficiently:
from honestroles import FilterChain, filter_jobs
# Functional approach:
df = filter_jobs(df, remote_only=True, min_salary=100_000)
# Composable approach:
chain = FilterChain()
chain.add(hr.filter.by_keywords, include=["Engineer"], exclude=["Manager"])
chain.add(hr.filter.by_completeness, required_fields=[schema.DESCRIPTION_TEXT, schema.APPLY_URL])
filtered_df = chain.apply(df)Ensure Ollama is running locally:
ollama serve
ollama pull llama3Then enable LLM-based labeling or quality rating:
df = hr.label_jobs(df, use_llm=True, model="llama3")
df = hr.rate_jobs(df, use_llm=True, model="llama3")src/honestroles/
├── clean/ # HTML stripping, normalization, and dedup
├── filter/ # Composed FilterChain and predicates
├── io/ # Parquet and DuckDB I/O with validation
├── label/ # Seniority, Category, and Tech Stack labeling
├── llm/ # Ollama client and prompt templates
├── rate/ # Completeness, Quality, and Composite ratings
└── schema.py # Centralized column name constants
Run the test suite with pytest:
pytestRun all CI-equivalent quality checks automatically before each local commit:
pip install -e ".[dev]"
pre-commit install
pre-commit run --all-filesThis installs a Git pre-commit hook that runs ruff, mypy, and pytest -m "not performance" -q.
- Changelog:
/CHANGELOG.md - Performance guardrails:
/docs/maintainers/performance.md - Docs index:
/docs/index.md