Implement Firecrawl caching to reduce API credit usage (#46) #68

mathurshrenya · 2025-09-24T03:02:13Z

This commit implements comprehensive caching functionality for Firecrawl web scraping to address issue #46: Cache firecrawl results so it doesn't use up the API credit.

Features implemented:

SQLite-based persistent caching with configurable TTL
URL normalization for consistent cache keys
Automatic cleanup and size management
Dual-layer caching (client-side + Firecrawl's maxAge parameter)
CLI commands for cache management (stats, clear, info, check)
Environment variable configuration
Comprehensive test suite with 20+ test cases
Complete documentation with usage examples

Files added:

pdd/firecrawl_cache.py: Core caching functionality
pdd/firecrawl_cache_cli.py: CLI commands for cache management
tests/test_firecrawl_cache.py: Comprehensive test suite
docs/firecrawl-caching.md: Complete documentation

Files modified:

pdd/preprocess.py: Updated to use caching with dual-layer approach
pdd/cli.py: Added firecrawl-cache command group

Configuration options:

FIRECRAWL_CACHE_ENABLE (default: true)
FIRECRAWL_CACHE_TTL_HOURS (default: 24)
FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100)
FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000)
FIRECRAWL_CACHE_AUTO_CLEANUP (default: true)

CLI commands:

pdd firecrawl-cache stats: View cache statistics
pdd firecrawl-cache clear: Clear all cached entries
pdd firecrawl-cache info: Show configuration
pdd firecrawl-cache check --url : Check specific URL

Benefits:

Significant reduction in API credit usage
Faster response times for cached content
Improved reliability with offline capability
Transparent integration with existing tags
Comprehensive management through CLI tools

This commit implements comprehensive caching functionality for Firecrawl web scraping to address issue promptdriven#46: Cache firecrawl results so it doesn't use up the API credit. Features implemented: - SQLite-based persistent caching with configurable TTL - URL normalization for consistent cache keys - Automatic cleanup and size management - Dual-layer caching (client-side + Firecrawl's maxAge parameter) - CLI commands for cache management (stats, clear, info, check) - Environment variable configuration - Comprehensive test suite with 20+ test cases - Complete documentation with usage examples Files added: - pdd/firecrawl_cache.py: Core caching functionality - pdd/firecrawl_cache_cli.py: CLI commands for cache management - tests/test_firecrawl_cache.py: Comprehensive test suite - docs/firecrawl-caching.md: Complete documentation Files modified: - pdd/preprocess.py: Updated to use caching with dual-layer approach - pdd/cli.py: Added firecrawl-cache command group Configuration options: - FIRECRAWL_CACHE_ENABLE (default: true) - FIRECRAWL_CACHE_TTL_HOURS (default: 24) - FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100) - FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000) - FIRECRAWL_CACHE_AUTO_CLEANUP (default: true) CLI commands: - pdd firecrawl-cache stats: View cache statistics - pdd firecrawl-cache clear: Clear all cached entries - pdd firecrawl-cache info: Show configuration - pdd firecrawl-cache check --url <url>: Check specific URL Benefits: - Significant reduction in API credit usage - Faster response times for cached content - Improved reliability with offline capability - Transparent integration with existing <web> tags - Comprehensive management through CLI tools

gltanaka · 2025-10-15T19:27:26Z

target 10/17

gltanaka · 2025-10-16T18:55:38Z

target 10/20

mathurshrenya · 2025-10-21T18:17:37Z

Target 10/22

mathurshrenya marked this pull request as draft September 24, 2025 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Firecrawl caching to reduce API credit usage (#46) #68

Implement Firecrawl caching to reduce API credit usage (#46) #68

Uh oh!

mathurshrenya commented Sep 24, 2025

Uh oh!

gltanaka commented Oct 15, 2025

Uh oh!

gltanaka commented Oct 16, 2025

Uh oh!

mathurshrenya commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement Firecrawl caching to reduce API credit usage (#46) #68

Are you sure you want to change the base?

Implement Firecrawl caching to reduce API credit usage (#46) #68

Uh oh!

Conversation

mathurshrenya commented Sep 24, 2025

Uh oh!

gltanaka commented Oct 15, 2025

Uh oh!

gltanaka commented Oct 16, 2025

Uh oh!

mathurshrenya commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants