Skip to content

Classify the kind of content hosted by the domain using the domain name, and text and screenshot of the homepage.

License

Notifications You must be signed in to change notification settings

themains/piedomains

Repository files navigation

piedomains: Classify website content using ML Models or LLMs

CI PyPI Version Downloads Documentation

🚀 What's New in v0.6.0

  • Streamlined JSON API: Simple, consistent JSON responses for easy integration with any workflow
  • Enhanced LLM Support: Built-in support for OpenAI, Anthropic, and Google AI models with custom category definitions
  • Advanced Archive Analysis: Analyze historical website versions from archive.org with intelligent rate limiting
  • Separated Data Collection: Collect website content once, run multiple classification approaches (ML + LLM + ensemble)
  • 41 Content Categories: Comprehensive classification including news, shopping, social media, education, finance, and more

Installation

pip install piedomains

Requires Python 3.11+

Basic Usage

from piedomains import DomainClassifier, DataCollector

classifier = DomainClassifier()
results = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])

for result in results:
    print(f"{result['domain']}: {result['category']} ({result['confidence']:.3f})")

# Output:
# cnn.com: news (0.876)
# amazon.com: shopping (0.923)
# wikipedia.org: education (0.891)

Classification Methods

# Combined text + image analysis (most accurate)
result = classifier.classify(["github.com"])

# Text-only classification (faster)
result = classifier.classify_by_text(["news.google.com"])

# Image-only classification
result = classifier.classify_by_images(["instagram.com"])

# Batch processing with separated workflow
collector = DataCollector()
collection = collector.collect_batch(domains, batch_size=50)
results = classifier.classify_from_collection(collection, method="text")

Historical Analysis

# Analyze archived versions from archive.org
old_result = classifier.classify(["facebook.com"], archive_date="20100101")

# Batch processing with archive.org (respects rate limits)
domains = ["google.com", "wikipedia.org", "cnn.com"]
collector = DataCollector(archive_date="20050101")
collection = collector.collect_batch(domains, batch_size=10)  # Archive.org uses conservative defaults
historical_results = classifier.classify_from_collection(collection, method="text")

Archive.org Rate Limits & Best Practices

The library automatically respects archive.org's rate limits:

  • CDX API: 1 request per second for snapshot lookups
  • Page fetching: Default 2 parallel contexts (vs 4 for live sites)
  • Auto-retry: Handles HTTP 429 responses with 60-second backoff

Configure archive-specific settings:

from piedomains.fetchers import ArchiveFetcher

# Conservative settings for large batches
fetcher = ArchiveFetcher("20100101", max_parallel=1)

# More aggressive (use carefully)
fetcher = ArchiveFetcher("20100101", max_parallel=3)

LLM Classification

# Configure LLM provider
classifier.configure_llm(
    provider="openai",
    model="gpt-4o",
    api_key="sk-...",
    categories=["news", "shopping", "social", "tech"]
)

# LLM-powered classification
result = classifier.classify_by_llm(["example.com"])

# With custom instructions
result = classifier.classify_by_llm(
    ["site.com"],
    custom_instructions="Classify by educational value"
)

Set API keys via environment variables:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..."

Categories

41 categories: news, finance, shopping, education, government, adult content, gambling, social networks, search engines, and others based on Shallalist taxonomy.

Security & Docker

v0.5.0 includes production-ready Docker containerization for secure domain analysis:

# Build secure sandbox container
docker build -t piedomains-sandbox .

# Run with security constraints (2GB RAM, 2 CPU, read-only filesystem)
docker run --rm --memory=2g --cpus=2 --read-only \
  --tmpfs /tmp --tmpfs /var/tmp \
  piedomains-sandbox python -c "
from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify(['example.com'])
print(result[['domain', 'pred_label']])
"

Batch Processing in Container:

# Use the included secure classification script
cd examples/sandbox
echo -e "wikipedia.org\ngithub.com\ncnn.com" > domains.txt
python3 secure_classify.py --file domains.txt

For testing, use known-safe domains: ["wikipedia.org", "github.com", "cnn.com"]

Documentation

Development

git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"
pytest tests/ -v

License

MIT License

Citation

@software{piedomains,
  title={piedomains: AI-powered domain content classification},
  author={Chintalapati, Rajashekar and Sood, Gaurav},
  year={2024},
  url={https://github.com/themains/piedomains}
}

About

Classify the kind of content hosted by the domain using the domain name, and text and screenshot of the homepage.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 8