Skip to content

dihy16/News-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Crawler

Python crawler that scrapes posts from tuoitre.vn. Given three category URLs and desired counts per category, it downloads article metadata, full content, comments (with replies), vote reactions, images, and audio (when available), then stores them locally.

Features

  • Fetches posts per category until thresholds are met (default >=100 total, ensures at least one post has >=20 comments).
  • Parses title, author, date, category, content HTML, reactions, comments + replies.
  • Downloads images into images/<postId>/ and audio into audio/<postId>.m4a when present.
  • Persists each post as data/<postId>.json.
  • Resilient HTTP client with retries and basic logging.

Project Structure

  • crawler.py — CLI entry point; handles args/prompts, kicks off crawling.
  • core/runner.py — coordinates category crawling, enforces min totals and comment threshold, saves outputs.
  • core/category.py — loads site categories, paginates category timelines, schedules article parsing.
  • core/article.py — parses an article page, downloads images/audio, gathers reactions and comments.
  • core/comments.py — fetches comments and replies via API and normalizes fields.
  • core/reactions.py — maps reaction codes to labels for posts/comments.
  • core/audio.py — checks and downloads audio TTS files for posts.
  • core/storage.py — ensures output directories, saves JSON, downloads files.
  • core/http.py — shared requests session with retries.
  • core/config.py — CSS selectors, category ID map, headers.

Requirements

  • Python 3.9+
  • See requirements.txt for Python dependencies.

Setup

python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Usage

Provide three category URLs and (optionally) counts per category:

python crawler.py --categories <cat1_url> <cat2_url> <cat3_url> --per_cat 40 35 35

If omitted, the crawler will prompt interactively for three URLs and counts.

Outputs:

  • JSON: data/<postId>.json
  • Images: images/<postId>/...
  • Audio: audio/<postId>.m4a (if available)

Notes

  • Defaults aim for at least 100 posts total and at least one post with 20+ comments; the runner will fetch extra batches if needed.
  • HTTP failures are logged and skipped; retries/backoff are enabled.
  • Update selectors or category IDs in core/config.py if tuoitre.vn changes structure.

About

Python web crawler scraping news from tuoitre.vn

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages