Skip to content

Conversation

@ajalote2
Copy link

This PR adds comprehensive caching functionality for web scraping using Firecrawl, addressing API quota management and improving efficiency. Key changes:

  • Added caching layer (SQLite-based) in firecrawl_cache.py
  • CLI enhancements for cache management (check, clear, info, stats) in firecrawl_cache_cli.py and cli.py
  • Integrated caching into preprocess.py for seamless web tag processing
  • Comprehensive unit tests in tests/test_firecrawl_cache.py (covers cache retrieval, expiration, normalization, CLI, and integration)
  • Handles cache directory configuration and cache metadata (TTL, access counts, etc.)

Resolves #46.

mathurshrenya and others added 10 commits September 23, 2025 23:01
This commit implements comprehensive caching functionality for Firecrawl web scraping
to address issue promptdriven#46: Cache firecrawl results so it doesn't use up the API credit.

Features implemented:
- SQLite-based persistent caching with configurable TTL
- URL normalization for consistent cache keys
- Automatic cleanup and size management
- Dual-layer caching (client-side + Firecrawl's maxAge parameter)
- CLI commands for cache management (stats, clear, info, check)
- Environment variable configuration
- Comprehensive test suite with 20+ test cases
- Complete documentation with usage examples

Files added:
- pdd/firecrawl_cache.py: Core caching functionality
- pdd/firecrawl_cache_cli.py: CLI commands for cache management
- tests/test_firecrawl_cache.py: Comprehensive test suite
- docs/firecrawl-caching.md: Complete documentation

Files modified:
- pdd/preprocess.py: Updated to use caching with dual-layer approach
- pdd/cli.py: Added firecrawl-cache command group

Configuration options:
- FIRECRAWL_CACHE_ENABLE (default: true)
- FIRECRAWL_CACHE_TTL_HOURS (default: 24)
- FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100)
- FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000)
- FIRECRAWL_CACHE_AUTO_CLEANUP (default: true)

CLI commands:
- pdd firecrawl-cache stats: View cache statistics
- pdd firecrawl-cache clear: Clear all cached entries
- pdd firecrawl-cache info: Show configuration
- pdd firecrawl-cache check --url <url>: Check specific URL

Benefits:
- Significant reduction in API credit usage
- Faster response times for cached content
- Improved reliability with offline capability
- Transparent integration with existing <web> tags
- Comprehensive management through CLI tools
@ajalote2 ajalote2 closed this Nov 24, 2025
@ajalote2 ajalote2 reopened this Nov 24, 2025
@ajalote2 ajalote2 marked this pull request as draft November 24, 2025 16:05
@gltanaka gltanaka requested a review from Copilot November 24, 2025 17:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive caching system for Firecrawl web scraping to reduce API credit consumption. The caching layer uses SQLite for persistent storage and integrates seamlessly with existing <web> tags in prompts.

Key changes:

  • SQLite-based caching with configurable TTL, size limits, and automatic cleanup
  • CLI commands for cache management (stats, clear, info, check)
  • Integration with preprocess.py to cache web scraping results automatically

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pdd/firecrawl_cache.py Core caching implementation with SQLite backend and URL normalization
pdd/firecrawl_cache_cli.py CLI commands for cache management and statistics
pdd/preprocess.py Integration of caching into web tag processing workflow
pdd/cli.py Registration of firecrawl-cache CLI command group
tests/test_firecrawl_cache.py Comprehensive test suite covering cache functionality and integration
docs/firecrawl-caching.md Complete documentation for the caching feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +388 to +391

def get_firecrawl_cache_stats():
cache = get_firecrawl_cache() # your singleton/getter
return cache.get_stats()
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate function definition: get_firecrawl_cache_stats is defined twice (lines 384-387 and 389-391). Remove the duplicate at lines 389-391.

Suggested change
def get_firecrawl_cache_stats():
cache = get_firecrawl_cache() # your singleton/getter
return cache.get_stats()

Copilot uses AI. Check for mistakes.
)
row = cursor.fetchone()
assert row is not None
stored_metadata = eval(row[0]) # Simple eval for test
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eval() is a security risk even in tests. Use json.loads() instead since the metadata is stored as JSON.

Copilot uses AI. Check for mistakes.
ajalote2 and others added 6 commits November 24, 2025 11:48
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache firecrawl results so it doesn't use up the API credit

2 participants