-
Notifications
You must be signed in to change notification settings - Fork 31
Implement Firecrawl caching to reduce API credit usage-Issue 46 #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit implements comprehensive caching functionality for Firecrawl web scraping to address issue promptdriven#46: Cache firecrawl results so it doesn't use up the API credit. Features implemented: - SQLite-based persistent caching with configurable TTL - URL normalization for consistent cache keys - Automatic cleanup and size management - Dual-layer caching (client-side + Firecrawl's maxAge parameter) - CLI commands for cache management (stats, clear, info, check) - Environment variable configuration - Comprehensive test suite with 20+ test cases - Complete documentation with usage examples Files added: - pdd/firecrawl_cache.py: Core caching functionality - pdd/firecrawl_cache_cli.py: CLI commands for cache management - tests/test_firecrawl_cache.py: Comprehensive test suite - docs/firecrawl-caching.md: Complete documentation Files modified: - pdd/preprocess.py: Updated to use caching with dual-layer approach - pdd/cli.py: Added firecrawl-cache command group Configuration options: - FIRECRAWL_CACHE_ENABLE (default: true) - FIRECRAWL_CACHE_TTL_HOURS (default: 24) - FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100) - FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000) - FIRECRAWL_CACHE_AUTO_CLEANUP (default: true) CLI commands: - pdd firecrawl-cache stats: View cache statistics - pdd firecrawl-cache clear: Clear all cached entries - pdd firecrawl-cache info: Show configuration - pdd firecrawl-cache check --url <url>: Check specific URL Benefits: - Significant reduction in API credit usage - Faster response times for cached content - Improved reliability with offline capability - Transparent integration with existing <web> tags - Comprehensive management through CLI tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a comprehensive caching system for Firecrawl web scraping to reduce API credit consumption. The caching layer uses SQLite for persistent storage and integrates seamlessly with existing <web> tags in prompts.
Key changes:
- SQLite-based caching with configurable TTL, size limits, and automatic cleanup
- CLI commands for cache management (stats, clear, info, check)
- Integration with preprocess.py to cache web scraping results automatically
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pdd/firecrawl_cache.py | Core caching implementation with SQLite backend and URL normalization |
| pdd/firecrawl_cache_cli.py | CLI commands for cache management and statistics |
| pdd/preprocess.py | Integration of caching into web tag processing workflow |
| pdd/cli.py | Registration of firecrawl-cache CLI command group |
| tests/test_firecrawl_cache.py | Comprehensive test suite covering cache functionality and integration |
| docs/firecrawl-caching.md | Complete documentation for the caching feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def get_firecrawl_cache_stats(): | ||
| cache = get_firecrawl_cache() # your singleton/getter | ||
| return cache.get_stats() |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate function definition: get_firecrawl_cache_stats is defined twice (lines 384-387 and 389-391). Remove the duplicate at lines 389-391.
| def get_firecrawl_cache_stats(): | |
| cache = get_firecrawl_cache() # your singleton/getter | |
| return cache.get_stats() |
| ) | ||
| row = cursor.fetchone() | ||
| assert row is not None | ||
| stored_metadata = eval(row[0]) # Simple eval for test |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using eval() is a security risk even in tests. Use json.loads() instead since the metadata is stored as JSON.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR adds comprehensive caching functionality for web scraping using Firecrawl, addressing API quota management and improving efficiency. Key changes:
Resolves #46.