Python crawler that scrapes posts from tuoitre.vn. Given three category URLs and desired counts per category, it downloads article metadata, full content, comments (with replies), vote reactions, images, and audio (when available), then stores them locally.
- Fetches posts per category until thresholds are met (default >=100 total, ensures at least one post has >=20 comments).
- Parses title, author, date, category, content HTML, reactions, comments + replies.
- Downloads images into
images/<postId>/and audio intoaudio/<postId>.m4awhen present. - Persists each post as
data/<postId>.json. - Resilient HTTP client with retries and basic logging.
crawler.py— CLI entry point; handles args/prompts, kicks off crawling.core/runner.py— coordinates category crawling, enforces min totals and comment threshold, saves outputs.core/category.py— loads site categories, paginates category timelines, schedules article parsing.core/article.py— parses an article page, downloads images/audio, gathers reactions and comments.core/comments.py— fetches comments and replies via API and normalizes fields.core/reactions.py— maps reaction codes to labels for posts/comments.core/audio.py— checks and downloads audio TTS files for posts.core/storage.py— ensures output directories, saves JSON, downloads files.core/http.py— shared requests session with retries.core/config.py— CSS selectors, category ID map, headers.
- Python 3.9+
- See
requirements.txtfor Python dependencies.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtProvide three category URLs and (optionally) counts per category:
python crawler.py --categories <cat1_url> <cat2_url> <cat3_url> --per_cat 40 35 35If omitted, the crawler will prompt interactively for three URLs and counts.
Outputs:
- JSON:
data/<postId>.json - Images:
images/<postId>/... - Audio:
audio/<postId>.m4a(if available)
- Defaults aim for at least 100 posts total and at least one post with 20+ comments; the runner will fetch extra batches if needed.
- HTTP failures are logged and skipped; retries/backoff are enabled.
- Update selectors or category IDs in
core/config.pyif tuoitre.vn changes structure.