cleaning-web-corpus

Domain-specific web crawling and data ingestion pipeline for household dirt, dust, stains, and cleaning knowledge.
The goal is to build a research-grade corpus and pipeline that can support LLM agents and cleaning robots with structured cleaning knowledge.

Overview

Multi-domain web crawler (Scrapy) targeting cleaning-related content (pillows, clothes, carpets, sofas).
Processing pipeline to extract main article text (trafilatura), apply quality filters, and store structured JSONL.
Domain-aware tagging (surface_type, dirt_type, cleaning_method) using heuristic rules.
Analysis tools to inspect distributions and tag co-occurrences (e.g., dirt_type × cleaning_method).
Extensible design for multi-modal data (image/video URLs, future robot sensor traces).

Quickstart

Create and activate a virtual environment:

python3 -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Run the crawler (from repo root):

cd crawler_project scrapy crawl seed_spider -O ../data/raw/seed_pages.jsonl cd ..

Run the processing pipeline:

python pipeline/process_seed_pages.py

Run the analysis:

python analysis/describe_corpus.py

Repository structure

cleaning-web-corpus/ crawler/ # Seed URLs and (later) search-guided seeds crawler_project/ # Scrapy project (spiders, settings) data/ raw/ # Raw crawl outputs (JSONL) processed/ # Processed, tagged corpus (JSONL) pipeline/ # HTML → text extraction, tagging, filters analysis/ # Stats scripts and experiment reports DATASET_CARD.md # Dataset card and intended uses requirements.txt README.md

Dataset and pipeline

See DATASET_CARD.md for a detailed description of:
- Motivation and intended uses.
- Schema and fields.
- Known limitations and future extensions.

Experiments

Experiment A – Seed targeting & coverage:
Analyze how adding targeted seeds changes tag distributions and dirt_type × cleaning_method coverage.
Experiment B – Length-based quality filtering:
Study how different minimum length thresholds affect corpus size and quality.

(Experiment reports live under analysis/experiments/.)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
configs		configs
data		data
dbt		dbt
docs		docs
experiments		experiments
models		models
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.clickhouse.yml		docker-compose.clickhouse.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cleaning-web-corpus

Overview

Quickstart

Repository structure

Dataset and pipeline

Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cleaning-web-corpus

Overview

Quickstart

Repository structure

Dataset and pipeline

Experiments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages