GitHub - dihy16/News-Crawler: Python web crawler scraping news from tuoitre.vn

News Crawler

Python crawler that scrapes posts from tuoitre.vn. Given three category URLs and desired counts per category, it downloads article metadata, full content, comments (with replies), vote reactions, images, and audio (when available), then stores them locally.

Features

Fetches posts per category until thresholds are met (default >=100 total, ensures at least one post has >=20 comments).
Parses title, author, date, category, content HTML, reactions, comments + replies.
Downloads images into images/<postId>/ and audio into audio/<postId>.m4a when present.
Persists each post as data/<postId>.json.
Resilient HTTP client with retries and basic logging.

Project Structure

crawler.py — CLI entry point; handles args/prompts, kicks off crawling.
core/runner.py — coordinates category crawling, enforces min totals and comment threshold, saves outputs.
core/category.py — loads site categories, paginates category timelines, schedules article parsing.
core/article.py — parses an article page, downloads images/audio, gathers reactions and comments.
core/comments.py — fetches comments and replies via API and normalizes fields.
core/reactions.py — maps reaction codes to labels for posts/comments.
core/audio.py — checks and downloads audio TTS files for posts.
core/storage.py — ensures output directories, saves JSON, downloads files.
core/http.py — shared requests session with retries.
core/config.py — CSS selectors, category ID map, headers.

Requirements

Python 3.9+
See requirements.txt for Python dependencies.

Setup

python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Usage

Provide three category URLs and (optionally) counts per category:

python crawler.py --categories <cat1_url> <cat2_url> <cat3_url> --per_cat 40 35 35

If omitted, the crawler will prompt interactively for three URLs and counts.

Outputs:

JSON: data/<postId>.json
Images: images/<postId>/...
Audio: audio/<postId>.m4a (if available)

Notes

Defaults aim for at least 100 posts total and at least one post with 20+ comments; the runner will fetch extra batches if needed.
HTTP failures are logged and skipped; retries/backoff are enabled.
Update selectors or category IDs in core/config.py if tuoitre.vn changes structure.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
core		core
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Crawler

Features

Project Structure

Requirements

Setup

Usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

News Crawler

Features

Project Structure

Requirements

Setup

Usage

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages