fraudcrawler

Fraudcrawler is an intelligent market monitoring tool that searches the web for products, extracts product details, and classifies them using LLMs. It combines search APIs, web scraping, and AI to automate product discovery and relevance assessment.

Features

Asynchronous pipeline - Products move through search, extraction, and classification stages independently
Multiple search engines - Google Search, Google Shopping, and more...
Search term enrichment - Automatically find related terms and expand your search
Product extraction - Get structured product data via Zyte API
LLM classification - Assess product relevance using OpenAI API with custom prompts
Marketplace filtering - Focus searches on specific domains
Deduplication - Avoid reprocessing previously collected URLs
CSV export - Results saved with timestamps for easy tracking

Prerequisites

Python 3.11 or higher
API keys for:
- SerpAPI - Google search results
- Zyte API - Product data extraction
- OpenAI API - Product classification
- DataForSEO (optional) - Search term enrichment

Installation

python3.11 -m venv .venv
source .venv/bin/activate
pip install fraudcrawler

Using Poetry:

poetry install

Configuration

Create a .env file with your API credentials (see .env.example for template):

SERPAPI_KEY=your_serpapi_key
ZYTEAPI_KEY=your_zyte_key
OPENAIAPI_KEY=your_openai_key
DATAFORSEO_USER=your_user  # optional
DATAFORSEO_PWD=your_pwd    # optional

Usage

Basic Configuration

For a complete working example, see fraudcrawler/launch_demo_pipeline.py. After setting up the necessary parameters you can launch and analyse the results with:

# Run pipeline
await client.run(
    search_term=search_term,
    search_engines=search_engines,
    language=language,
    location=location,
    deepness=deepness,
    excluded_urls=excluded_urls,
)

# Load results
df = client.load_results()
print(df.head())

Advanced Configuration

Search term enrichment - Find and search related terms:

from fraudcrawler import Enrichment

deepness.enrichment = Enrichment(
    additional_terms=5,
    additional_urls_per_term=10
)

Marketplace filtering - Focus on specific domains:

from fraudcrawler import Host

marketplaces = [
    Host(name="International", domains="zavamed.com,apomeds.com"),
    Host(name="National", domains="netdoktor.ch,nobelpharma.ch"),
]

await client.run(..., marketplaces=marketplaces)

Exclude domains - Exclude specific domains from your results:

excluded_urls = [
    Host(name="Compendium", domains="compendium.ch"),
]

await client.run(..., excluded_urls=excluded_urls)

Skip previously collected URLs:

previously_collected_urls = [
    "https://example.com/product1",
    "https://example.com/product2",
]

await client.run(..., previously_collected_urls=previously_collected_urls)

Website source search - Ingest product listings from configured website templates:

from fraudcrawler import SearchEngineName
from fraudcrawler.scraping.utils import build_website_source_profile

source = build_website_source_profile(
    name="My Shop",
    base_url="https://shop.example/",
    searchable_urls=[
        {
            "filterUrl": "search?q={search_term}",
            "includeSubstrings": ["/p/"],
            "excludeSubstrings": [],
        }
    ],
    render_options={
        "javascript": True,
        "includeIframes": False,
        "actions": [],
        "networkCapture": [],
    },
)

await client.run(
    ...,
    search_engines=[SearchEngineName.WEBSITE_SOURCE],
    website_source_sources=[source],
)

Notes:

Website-source jobs run for the initial search term only (enrichment terms are not used for website-source ingestion).
URL results still pass the regular country-code filtering used by the scraping pipeline.

Redis cache – Set REDIS_USE_CACHE=true and run Redis to cache API and scrape calls (Searcher, Enricher, Zyte, Workflow).

View all results from a client instance:

client.print_available_results()

Output

Results are saved as CSV files in data/results/ with the naming pattern:

<search_term>_<language_code>_<location_code>_<timestamp>.csv

Example: sildenafil_de_ch_20250115143022.csv

The CSV includes product details, URLs, and classification scores from your workflows. Raw page HTML is intentionally excluded from CSV exports to keep result files smaller.

Development

For detailed contribution guidelines, see CONTRIBUTING.md.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Architecture

Fraudcrawler uses an asynchronous pipeline where products can be at different processing stages simultaneously. Product A might be in classification while Product B is still being scraped. This is enabled by async workers for each stage (Search, Context Extraction, Processing) using httpx.AsyncClient.

For more details on the async design, see the httpx documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 511 Commits
.claude/skills		.claude/skills
.cursor		.cursor
.github/workflows		.github/workflows
docs/assets/images		docs/assets/images
fraudcrawler		fraudcrawler
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fraudcrawler

Features

Prerequisites

Installation

Configuration

Usage

Basic Configuration

Advanced Configuration

Output

Development

License

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fraudcrawler

Features

Prerequisites

Installation

Configuration

Usage

Basic Configuration

Advanced Configuration

Output

Development

License

Architecture

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages