Skip to content

feat: add GNews Collector tool for automated news collection from Google News#2

Open
rubenszinho wants to merge 5 commits intomainfrom
tool/gnews
Open

feat: add GNews Collector tool for automated news collection from Google News#2
rubenszinho wants to merge 5 commits intomainfrom
tool/gnews

Conversation

@rubenszinho
Copy link
Copy Markdown
Collaborator

GNews Collector was built from a notebook prototype used for clipping collection at ICMC/USP. The prototype was refactored into a clean component with two distinct calls: general collection and source-filtered collection. Both feature automatic period splitting into quarterly windows (to bypass GNews' 100-results-per-query limit), asynchronous progress tracking, and Parquet output.

Closes #1

… Strategy pattern refactor

- Add SerpAPIBackend as primary backend via Google News Light API (paginated,
  no throttling); GNewsBackend remains as free fallback. Selection is automatic
  based on the new `serpapi_key` Valve field.
- Add `window_months` parameter to `collect_general_news` (default 3) to allow
  monthly windows for long-range, high-coverage collections.
- Refactor internal architecture with the Strategy pattern: extract
  _base_backend.py (NewsBackend ABC), _gnews_backend.py, and
  _serpapi_backend.py. Tools class is now a thin orchestrator delegating
  search logic to the active backend via the `_backend` property.
- Add `openpyxl` to the [gnews] optional dependency group for XLS export.
- Add scripts/collect_keywords_sp_rj.py: ready-to-run collection script for
  7 crime/security keywords over 2010-2025 with monthly windows, checkpoint/
  resume support, and consolidated XLS export.
- Update TOOLS.md to document the new modular structure and backend options.

BREAKING CHANGE: none — public method signatures, Valves schema, and output
parquet/JSON format are unchanged.

Made-with: Cursor
…configuration

- Add `python-dotenv` to manage environment variables for SerpAPI key.
- Update `collect_keywords_sp_rj.py` to support automatic backend selection between SerpAPI and GNews, with improved documentation on usage.
- Refactor backend handling in `gnews_collector` to include backend-specific sleep logic and expose backend names for better clarity in logs.
- Adjust sleep duration handling to be conditional based on the selected backend, ensuring optimal performance and avoiding throttling.

This update improves the flexibility and usability of the news collection tool.
- Update `pyproject.toml` to support Python 3.10.
- Enhance `collect_keywords_sp_rj.py` with automatic backend switching and improved error handling for rate limits.
- Introduce `RateLimitError` in backend classes to manage rate-limiting scenarios more effectively.
- Refactor backend selection logic in `gnews_collector` to allow for dynamic backend prioritization and automatic retries.
- Update documentation to reflect new features and usage instructions.

These changes enhance the robustness and flexibility of the news collection tool, ensuring smoother operation under varying conditions.
… error handling

- Update error messages to clarify backend removal during rate limit and fallback scenarios.
- Refactor backend priority handling to ensure the current backend is removed from the chain when encountering rate limits or errors.

These changes enhance the clarity of the logging and improve the backend management process in the news collection tool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task: Refatorar coletor GNews para uso genérico no Agents4Gov e exposição como tool

1 participant