Implement Firecrawl caching to reduce API credit usage-Issue 46 #151

ajalote2 · 2025-11-24T15:31:48Z

This PR adds comprehensive caching functionality for web scraping using Firecrawl, addressing API quota management and improving efficiency. Key changes:

Added caching layer (SQLite-based) in firecrawl_cache.py
CLI enhancements for cache management (check, clear, info, stats) in firecrawl_cache_cli.py and cli.py
Integrated caching into preprocess.py for seamless web tag processing
Comprehensive unit tests in tests/test_firecrawl_cache.py (covers cache retrieval, expiration, normalization, CLI, and integration)
Handles cache directory configuration and cache metadata (TTL, access counts, etc.)

Resolves #46.

This commit implements comprehensive caching functionality for Firecrawl web scraping to address issue promptdriven#46: Cache firecrawl results so it doesn't use up the API credit. Features implemented: - SQLite-based persistent caching with configurable TTL - URL normalization for consistent cache keys - Automatic cleanup and size management - Dual-layer caching (client-side + Firecrawl's maxAge parameter) - CLI commands for cache management (stats, clear, info, check) - Environment variable configuration - Comprehensive test suite with 20+ test cases - Complete documentation with usage examples Files added: - pdd/firecrawl_cache.py: Core caching functionality - pdd/firecrawl_cache_cli.py: CLI commands for cache management - tests/test_firecrawl_cache.py: Comprehensive test suite - docs/firecrawl-caching.md: Complete documentation Files modified: - pdd/preprocess.py: Updated to use caching with dual-layer approach - pdd/cli.py: Added firecrawl-cache command group Configuration options: - FIRECRAWL_CACHE_ENABLE (default: true) - FIRECRAWL_CACHE_TTL_HOURS (default: 24) - FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100) - FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000) - FIRECRAWL_CACHE_AUTO_CLEANUP (default: true) CLI commands: - pdd firecrawl-cache stats: View cache statistics - pdd firecrawl-cache clear: Clear all cached entries - pdd firecrawl-cache info: Show configuration - pdd firecrawl-cache check --url <url>: Check specific URL Benefits: - Significant reduction in API credit usage - Faster response times for cached content - Improved reliability with offline capability - Transparent integration with existing <web> tags - Comprehensive management through CLI tools

…hing

Copilot

Pull request overview

This PR implements a comprehensive caching system for Firecrawl web scraping to reduce API credit consumption. The caching layer uses SQLite for persistent storage and integrates seamlessly with existing <web> tags in prompts.

Key changes:

SQLite-based caching with configurable TTL, size limits, and automatic cleanup
CLI commands for cache management (stats, clear, info, check)
Integration with preprocess.py to cache web scraping results automatically

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
pdd/firecrawl_cache.py	Core caching implementation with SQLite backend and URL normalization
pdd/firecrawl_cache_cli.py	CLI commands for cache management and statistics
pdd/preprocess.py	Integration of caching into web tag processing workflow
pdd/cli.py	Registration of firecrawl-cache CLI command group
tests/test_firecrawl_cache.py	Comprehensive test suite covering cache functionality and integration
docs/firecrawl-caching.md	Complete documentation for the caching feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-24T17:45:08Z

pdd/firecrawl_cache.py

+
+def get_firecrawl_cache_stats():
+    cache = get_firecrawl_cache()  # your singleton/getter
+    return cache.get_stats()


Duplicate function definition: get_firecrawl_cache_stats is defined twice (lines 384-387 and 389-391). Remove the duplicate at lines 389-391.

Suggested change

def get_firecrawl_cache_stats():

cache = get_firecrawl_cache() # your singleton/getter

return cache.get_stats()

pdd/preprocess.py

Copilot · 2025-11-24T17:45:09Z

tests/test_firecrawl_cache.py

+            )
+            row = cursor.fetchone()
+            assert row is not None
+            stored_metadata = eval(row[0])  # Simple eval for test


Using eval() is a security risk even in tests. Use json.loads() instead since the metadata is stored as JSON.

pdd/firecrawl_cache_cli.py

pdd/cli.py

tests/test_firecrawl_cache.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mathurshrenya and others added 10 commits September 23, 2025 23:01

Adding project as demo

83e59c9

Changing prompt for proper test

0673453

Prompt was updated for example but empty file was generated

b313357

Final prompt

2493031

Updated prompt but now one test case fails

d131fe7

Prompt fixed-all test cases pass

7c1e046

Final prompt

972e227

Merge remote-tracking branch 'origin/main' into feature/firecrawl-cac…

31920be

…hing

Implement and test Firecrawl caching and CLI integration

8f637b4

ajalote2 closed this Nov 24, 2025

Remove demo folder from PR

f0f311b

ajalote2 reopened this Nov 24, 2025

Remove examples/hello/hello.py from PR

d40f792

ajalote2 marked this pull request as draft November 24, 2025 16:05

gltanaka requested a review from Copilot November 24, 2025 17:43

Copilot AI reviewed Nov 24, 2025

View reviewed changes

ajalote2 and others added 6 commits November 24, 2025 11:48

Update pdd/cli.py

f4b881a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/test_firecrawl_cache.py

cdf85b6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update pdd/cli.py

b147f4c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update pdd/firecrawl_cache_cli.py

211b92d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update pdd/preprocess.py

dfef0cc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update pdd/preprocess.py

02cf165

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Firecrawl caching to reduce API credit usage-Issue 46 #151

Implement Firecrawl caching to reduce API credit usage-Issue 46 #151

ajalote2 commented Nov 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement Firecrawl caching to reduce API credit usage-Issue 46 #151

Are you sure you want to change the base?

Implement Firecrawl caching to reduce API credit usage-Issue 46 #151

Conversation

ajalote2 commented Nov 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants