feat: add GNews Collector tool for automated news collection from Google News#2
Open
rubenszinho wants to merge 5 commits intomainfrom
Open
feat: add GNews Collector tool for automated news collection from Google News#2rubenszinho wants to merge 5 commits intomainfrom
rubenszinho wants to merge 5 commits intomainfrom
Conversation
… Strategy pattern refactor - Add SerpAPIBackend as primary backend via Google News Light API (paginated, no throttling); GNewsBackend remains as free fallback. Selection is automatic based on the new `serpapi_key` Valve field. - Add `window_months` parameter to `collect_general_news` (default 3) to allow monthly windows for long-range, high-coverage collections. - Refactor internal architecture with the Strategy pattern: extract _base_backend.py (NewsBackend ABC), _gnews_backend.py, and _serpapi_backend.py. Tools class is now a thin orchestrator delegating search logic to the active backend via the `_backend` property. - Add `openpyxl` to the [gnews] optional dependency group for XLS export. - Add scripts/collect_keywords_sp_rj.py: ready-to-run collection script for 7 crime/security keywords over 2010-2025 with monthly windows, checkpoint/ resume support, and consolidated XLS export. - Update TOOLS.md to document the new modular structure and backend options. BREAKING CHANGE: none — public method signatures, Valves schema, and output parquet/JSON format are unchanged. Made-with: Cursor
…configuration - Add `python-dotenv` to manage environment variables for SerpAPI key. - Update `collect_keywords_sp_rj.py` to support automatic backend selection between SerpAPI and GNews, with improved documentation on usage. - Refactor backend handling in `gnews_collector` to include backend-specific sleep logic and expose backend names for better clarity in logs. - Adjust sleep duration handling to be conditional based on the selected backend, ensuring optimal performance and avoiding throttling. This update improves the flexibility and usability of the news collection tool.
- Update `pyproject.toml` to support Python 3.10. - Enhance `collect_keywords_sp_rj.py` with automatic backend switching and improved error handling for rate limits. - Introduce `RateLimitError` in backend classes to manage rate-limiting scenarios more effectively. - Refactor backend selection logic in `gnews_collector` to allow for dynamic backend prioritization and automatic retries. - Update documentation to reflect new features and usage instructions. These changes enhance the robustness and flexibility of the news collection tool, ensuring smoother operation under varying conditions.
… error handling - Update error messages to clarify backend removal during rate limit and fallback scenarios. - Refactor backend priority handling to ensure the current backend is removed from the chain when encountering rate limits or errors. These changes enhance the clarity of the logging and improve the backend management process in the news collection tool.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GNews Collector was built from a notebook prototype used for clipping collection at ICMC/USP. The prototype was refactored into a clean component with two distinct calls: general collection and source-filtered collection. Both feature automatic period splitting into quarterly windows (to bypass GNews' 100-results-per-query limit), asynchronous progress tracking, and Parquet output.
Closes #1