Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
240 changes: 240 additions & 0 deletions docs/firecrawl-caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# Firecrawl Caching

This document describes the Firecrawl caching functionality implemented to address issue #46: "Cache firecrawl results so it doesn't use up the API credit".

## Overview

The Firecrawl caching system reduces API credit usage by storing scraped web content locally and serving cached results for subsequent requests within the configured cache duration. This is particularly useful for:

- Development and testing environments
- Documentation sites that don't change frequently
- Bulk processing jobs
- Reducing costs for repeated scraping of the same URLs

## Features

### Core Functionality
- **Persistent Caching**: SQLite-based storage for reliable caching across sessions
- **Configurable TTL**: Set cache duration per request or globally
- **URL Normalization**: Consistent cache keys by normalizing URLs (removing tracking parameters, case-insensitive)
- **Automatic Cleanup**: Expired entries are automatically removed
- **Size Management**: Configurable limits on cache size and number of entries
- **Access Tracking**: Monitor cache usage and efficiency

### Integration
- **Seamless Integration**: Works transparently with existing `<web>` tags in prompts
- **Firecrawl API Integration**: Uses Firecrawl's built-in `maxAge` parameter for server-side caching
- **Dual-Layer Caching**: Combines client-side and server-side caching for maximum efficiency

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `FIRECRAWL_CACHE_ENABLE` | `true` | Enable/disable caching |
| `FIRECRAWL_CACHE_TTL_HOURS` | `24` | Default cache TTL in hours |
| `FIRECRAWL_CACHE_MAX_SIZE_MB` | `100` | Maximum cache size in MB |
| `FIRECRAWL_CACHE_MAX_ENTRIES` | `1000` | Maximum number of cache entries |
| `FIRECRAWL_CACHE_AUTO_CLEANUP` | `true` | Enable automatic cleanup |
| `FIRECRAWL_API_KEY` | Required | Firecrawl API key for scraping |

### Example Configuration

```bash
# Enable caching with 48-hour TTL
export FIRECRAWL_CACHE_TTL_HOURS=48

# Set cache size limit to 200MB
export FIRECRAWL_CACHE_MAX_SIZE_MB=200

# Disable caching for real-time data
export FIRECRAWL_CACHE_ENABLE=false
```

## Usage

### Automatic Usage

The caching system works automatically with existing `<web>` tags in your prompts:

```prompt
# This will use cached content if available
<web>https://docs.example.com/api-reference</web>
```

### CLI Commands

#### View Cache Statistics
```bash
pdd firecrawl-cache stats
```

Shows:
- Total and active cache entries
- Cache size and efficiency
- Access patterns and statistics

#### Clear Cache
```bash
pdd firecrawl-cache clear
```

Removes all cached entries (with confirmation prompt).

#### Check Specific URL
```bash
pdd firecrawl-cache check --url https://example.com
```

Shows whether a specific URL is cached and displays content preview.

#### View Configuration
```bash
pdd firecrawl-cache info
```

Displays current cache configuration and environment variables.

### Programmatic Usage

```python
from pdd.firecrawl_cache import get_firecrawl_cache

# Get cache instance
cache = get_firecrawl_cache()

# Check if URL is cached
content = cache.get("https://example.com")
if content is None:
# URL not cached, would need to scrape
pass

# Cache content manually
cache.set("https://example.com", "web content", ttl_hours=12)

# Get cache statistics
stats = cache.get_stats()
print(f"Cache efficiency: {stats['active_entries']}/{stats['total_entries']}")
```

## Cache Storage

### Location
- **Default**: `{project_root}/cache/firecrawl/firecrawl_cache.db`
- **Custom**: Set via `FirecrawlCache(cache_dir="/path/to/cache")`

### Database Schema
```sql
CREATE TABLE cache_entries (
url_hash TEXT PRIMARY KEY, -- SHA256 hash of normalized URL
url TEXT NOT NULL, -- Original URL
content TEXT NOT NULL, -- Cached content
timestamp REAL NOT NULL, -- When cached
expires_at REAL NOT NULL, -- When expires
content_hash TEXT NOT NULL, -- MD5 hash of content
metadata TEXT NOT NULL, -- JSON metadata
access_count INTEGER DEFAULT 0, -- Number of accesses
last_accessed REAL DEFAULT 0 -- Last access time
);
```

## Best Practices

### When to Use Caching
- **Static Content**: Documentation, API references, articles
- **Development**: Testing with the same URLs repeatedly
- **Bulk Processing**: Scraping multiple pages from the same site
- **Cost Optimization**: Reducing API credit usage

### When NOT to Use Caching
- **Real-time Data**: Stock prices, live scores, breaking news
- **Frequently Updated Content**: Social media feeds, dynamic dashboards
- **Time-sensitive Information**: Where freshness is critical

### Cache Duration Guidelines
- **Documentation**: 24-168 hours (1-7 days)
- **API References**: 24-72 hours (1-3 days)
- **News Articles**: 1-6 hours
- **Static Pages**: 168+ hours (7+ days)

## Performance Impact

### Benefits
- **Reduced API Costs**: Significant savings on repeated requests
- **Faster Response Times**: Cached content loads instantly
- **Reduced Network Usage**: Less bandwidth consumption
- **Improved Reliability**: Works offline for cached content

### Overhead
- **Storage Space**: Cache database grows over time
- **Initial Setup**: First request still requires API call
- **Memory Usage**: Minimal impact on application memory

## Troubleshooting

### Common Issues

#### Cache Not Working
1. Check if caching is enabled: `pdd firecrawl-cache info`
2. Verify environment variables are set correctly
3. Check cache directory permissions

#### High Storage Usage
1. Reduce `FIRECRAWL_CACHE_MAX_SIZE_MB`
2. Lower `FIRECRAWL_CACHE_MAX_ENTRIES`
3. Clear cache: `pdd firecrawl-cache clear`

#### Stale Content
1. Reduce `FIRECRAWL_CACHE_TTL_HOURS`
2. Clear specific entries or entire cache
3. Use `maxAge=0` in Firecrawl API calls for fresh content

### Debug Information

```bash
# View detailed cache statistics
pdd firecrawl-cache stats

# Check cache configuration
pdd firecrawl-cache info

# Test specific URL
pdd firecrawl-cache check --url https://example.com
```

## Implementation Details

### Architecture
- **Client-side Cache**: SQLite database for persistent storage
- **Server-side Cache**: Firecrawl's built-in caching via `maxAge` parameter
- **URL Normalization**: Consistent cache keys across requests
- **Automatic Cleanup**: Background maintenance of cache health

### Security Considerations
- Cache content is stored in plain text
- URLs are normalized but original URLs are preserved
- No authentication or encryption for cached data
- Consider cache location security for sensitive content

### Future Enhancements
- Compression for large cached content
- Cache warming strategies
- Distributed caching support
- Cache analytics and reporting
- Integration with other caching systems

## Related Issues

This implementation addresses:
- **Issue #46**: Cache firecrawl results so it doesn't use up the API credit

## Contributing

To contribute to the Firecrawl caching functionality:

1. Follow the existing code style and patterns
2. Add tests for new functionality
3. Update documentation for any changes
4. Consider backward compatibility
5. Test with various URL patterns and content types
30 changes: 30 additions & 0 deletions pdd/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -1540,6 +1540,36 @@ def install_completion_cmd(ctx: click.Context) -> None: # Return type remains No
# Do not return anything, as the callback expects None or a tuple


@cli.group("firecrawl-cache")
def firecrawl_cache_group():
"""Manage Firecrawl web scraping cache to reduce API credit usage."""
pass

@firecrawl_cache_group.command("stats")
def firecrawl_cache_stats():
"""Show Firecrawl cache statistics."""
from .firecrawl_cache_cli import stats
stats()

@firecrawl_cache_group.command("clear")
def firecrawl_cache_clear():
"""Clear all cached Firecrawl entries."""
from .firecrawl_cache_cli import clear
clear()

@firecrawl_cache_group.command("info")
def firecrawl_cache_info():
"""Show Firecrawl cache configuration and environment variables."""
from .firecrawl_cache_cli import info
info()

@firecrawl_cache_group.command("check")
@click.option('--url', help='URL to check in cache')
def firecrawl_cache_check(url):
"""Check if a specific URL is cached."""
from .firecrawl_cache_cli import check
check(url)

@cli.command("setup")
@click.pass_context
def setup_cmd(ctx: click.Context) -> None:
Expand Down
Loading