piedomains: Classify website content using ML Models or LLMs

🚀 What's New in v0.6.0

Streamlined JSON API: Simple, consistent JSON responses for easy integration with any workflow
Enhanced LLM Support: Built-in support for OpenAI, Anthropic, and Google AI models with custom category definitions
Advanced Archive Analysis: Analyze historical website versions from archive.org with intelligent rate limiting
Separated Data Collection: Collect website content once, run multiple classification approaches (ML + LLM + ensemble)
41 Content Categories: Comprehensive classification including news, shopping, social media, education, finance, and more

Installation

pip install piedomains

Requires Python 3.11+

Basic Usage

from piedomains import DomainClassifier, DataCollector

classifier = DomainClassifier()
results = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])

for result in results:
    print(f"{result['domain']}: {result['category']} ({result['confidence']:.3f})")

# Output:
# cnn.com: news (0.876)
# amazon.com: shopping (0.923)
# wikipedia.org: education (0.891)

Classification Methods

# Combined text + image analysis (most accurate)
result = classifier.classify(["github.com"])

# Text-only classification (faster)
result = classifier.classify_by_text(["news.google.com"])

# Image-only classification
result = classifier.classify_by_images(["instagram.com"])

# Batch processing with separated workflow
collector = DataCollector()
collection = collector.collect_batch(domains, batch_size=50)
results = classifier.classify_from_collection(collection, method="text")

Historical Analysis

# Analyze archived versions from archive.org
old_result = classifier.classify(["facebook.com"], archive_date="20100101")

# Batch processing with archive.org (respects rate limits)
domains = ["google.com", "wikipedia.org", "cnn.com"]
collector = DataCollector(archive_date="20050101")
collection = collector.collect_batch(domains, batch_size=10)  # Archive.org uses conservative defaults
historical_results = classifier.classify_from_collection(collection, method="text")

Archive.org Rate Limits & Best Practices

The library automatically respects archive.org's rate limits:

CDX API: 1 request per second for snapshot lookups
Page fetching: Default 2 parallel contexts (vs 4 for live sites)
Auto-retry: Handles HTTP 429 responses with 60-second backoff

Configure archive-specific settings:

from piedomains.fetchers import ArchiveFetcher

# Conservative settings for large batches
fetcher = ArchiveFetcher("20100101", max_parallel=1)

# More aggressive (use carefully)
fetcher = ArchiveFetcher("20100101", max_parallel=3)

LLM Classification

# Configure LLM provider
classifier.configure_llm(
    provider="openai",
    model="gpt-4o",
    api_key="sk-...",
    categories=["news", "shopping", "social", "tech"]
)

# LLM-powered classification
result = classifier.classify_by_llm(["example.com"])

# With custom instructions
result = classifier.classify_by_llm(
    ["site.com"],
    custom_instructions="Classify by educational value"
)

Set API keys via environment variables:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..."

Security & Docker

v0.5.0 includes production-ready Docker containerization for secure domain analysis:

# Build secure sandbox container
docker build -t piedomains-sandbox .

# Run with security constraints (2GB RAM, 2 CPU, read-only filesystem)
docker run --rm --memory=2g --cpus=2 --read-only \
  --tmpfs /tmp --tmpfs /var/tmp \
  piedomains-sandbox python -c "
from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify(['example.com'])
print(result[['domain', 'pred_label']])
"

Batch Processing in Container:

# Use the included secure classification script
cd examples/sandbox
echo -e "wikipedia.org\ngithub.com\ncnn.com" > domains.txt
python3 secure_classify.py --file domains.txt

For testing, use known-safe domains: ["wikipedia.org", "github.com", "cnn.com"]

Documentation

Development

git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"
pytest tests/ -v

License

MIT License

Citation

@software{piedomains,
  title={piedomains: AI-powered domain content classification},
  author={Chintalapati, Rajashekar and Sood, Gaurav},
  year={2024},
  url={https://github.com/themains/piedomains}
}

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
notebooks		notebooks
piedomains		piedomains
streamlit		streamlit
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Citation.cff		Citation.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-entrypoint.sh		docker-entrypoint.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

piedomains: Classify website content using ML Models or LLMs

🚀 What's New in v0.6.0

Installation

Basic Usage

Classification Methods

Historical Analysis

Archive.org Rate Limits & Best Practices

LLM Classification

Categories

Security & Docker

Documentation

Development

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

themains/piedomains

Folders and files

Latest commit

History

Repository files navigation

piedomains: Classify website content using ML Models or LLMs

🚀 What's New in v0.6.0

Installation

Basic Usage

Classification Methods

Historical Analysis

Archive.org Rate Limits & Best Practices

LLM Classification

Categories

Security & Docker

Documentation

Development

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages