Skip to content

Conversation

@mathurshrenya
Copy link
Contributor

This commit implements comprehensive caching functionality for Firecrawl web scraping to address issue #46: Cache firecrawl results so it doesn't use up the API credit.

Features implemented:

  • SQLite-based persistent caching with configurable TTL
  • URL normalization for consistent cache keys
  • Automatic cleanup and size management
  • Dual-layer caching (client-side + Firecrawl's maxAge parameter)
  • CLI commands for cache management (stats, clear, info, check)
  • Environment variable configuration
  • Comprehensive test suite with 20+ test cases
  • Complete documentation with usage examples

Files added:

  • pdd/firecrawl_cache.py: Core caching functionality
  • pdd/firecrawl_cache_cli.py: CLI commands for cache management
  • tests/test_firecrawl_cache.py: Comprehensive test suite
  • docs/firecrawl-caching.md: Complete documentation

Files modified:

  • pdd/preprocess.py: Updated to use caching with dual-layer approach
  • pdd/cli.py: Added firecrawl-cache command group

Configuration options:

  • FIRECRAWL_CACHE_ENABLE (default: true)
  • FIRECRAWL_CACHE_TTL_HOURS (default: 24)
  • FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100)
  • FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000)
  • FIRECRAWL_CACHE_AUTO_CLEANUP (default: true)

CLI commands:

  • pdd firecrawl-cache stats: View cache statistics
  • pdd firecrawl-cache clear: Clear all cached entries
  • pdd firecrawl-cache info: Show configuration
  • pdd firecrawl-cache check --url : Check specific URL

Benefits:

  • Significant reduction in API credit usage
  • Faster response times for cached content
  • Improved reliability with offline capability
  • Transparent integration with existing tags
  • Comprehensive management through CLI tools

This commit implements comprehensive caching functionality for Firecrawl web scraping
to address issue promptdriven#46: Cache firecrawl results so it doesn't use up the API credit.

Features implemented:
- SQLite-based persistent caching with configurable TTL
- URL normalization for consistent cache keys
- Automatic cleanup and size management
- Dual-layer caching (client-side + Firecrawl's maxAge parameter)
- CLI commands for cache management (stats, clear, info, check)
- Environment variable configuration
- Comprehensive test suite with 20+ test cases
- Complete documentation with usage examples

Files added:
- pdd/firecrawl_cache.py: Core caching functionality
- pdd/firecrawl_cache_cli.py: CLI commands for cache management
- tests/test_firecrawl_cache.py: Comprehensive test suite
- docs/firecrawl-caching.md: Complete documentation

Files modified:
- pdd/preprocess.py: Updated to use caching with dual-layer approach
- pdd/cli.py: Added firecrawl-cache command group

Configuration options:
- FIRECRAWL_CACHE_ENABLE (default: true)
- FIRECRAWL_CACHE_TTL_HOURS (default: 24)
- FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100)
- FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000)
- FIRECRAWL_CACHE_AUTO_CLEANUP (default: true)

CLI commands:
- pdd firecrawl-cache stats: View cache statistics
- pdd firecrawl-cache clear: Clear all cached entries
- pdd firecrawl-cache info: Show configuration
- pdd firecrawl-cache check --url <url>: Check specific URL

Benefits:
- Significant reduction in API credit usage
- Faster response times for cached content
- Improved reliability with offline capability
- Transparent integration with existing <web> tags
- Comprehensive management through CLI tools
@mathurshrenya mathurshrenya marked this pull request as draft September 24, 2025 03:02
@gltanaka
Copy link
Contributor

target 10/17

@gltanaka
Copy link
Contributor

target 10/20

@mathurshrenya
Copy link
Contributor Author

Target 10/22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants