Skip to content

Add caching for parsed syllabus data#7

Open
schmerl wants to merge 1 commit intomainfrom
feature/syllabus-cache
Open

Add caching for parsed syllabus data#7
schmerl wants to merge 1 commit intomainfrom
feature/syllabus-cache

Conversation

@schmerl
Copy link
Copy Markdown
Contributor

@schmerl schmerl commented Dec 4, 2025

Overview

Implements persistent file-based caching for ParsedSyllabus objects to dramatically improve performance and reduce redundant LLM API calls.

Fixes #6

Key Features

  • Content-based caching using SHA-256 hash of raw PDF bytes
  • Pickle serialization for efficient Python object storage
  • System cache directory: ~/.cache/syllabusmcp/ on macOS/Linux
  • Graceful error handling: Cache failures never break core functionality
  • Comprehensive API: get, set, invalidate, clear, list_keys, get_statistics
  • Environment variable support:
    • SYLLABUSMCP_CACHE_DIR: Override cache location
    • SYLLABUSMCP_DISABLE_CACHE: Disable caching entirely
  • Per-call control: use_cache parameter on parse_syllabus()

Performance Improvements

  • 80%+ reduction in parsing time for repeated queries
  • 100% reduction in LLM API calls for cached syllabi
  • Cache read: ~1-5ms vs LLM parsing: ~2-10 seconds

Testing

  • 16 unit tests for cache operations
  • 5 integration tests for parse_syllabus caching
  • All 21 tests passing

Files Changed

Created

  • syllabus_server/cache.py - Core cache implementation (261 lines)
  • tests/test_cache.py - Unit tests (260 lines)
  • tests/test_cache_integration.py - Integration tests (220 lines)

Modified

  • syllabus_server/server.py - Integrated cache into parse_syllabus()

Usage Example

from syllabus_server.server import parse_syllabus

# First call - parses PDF and caches result (~5 seconds)
syllabus = parse_syllabus("pdfs/17603.pdf")

# Second call - returns cached result immediately (~5ms)
syllabus = parse_syllabus("pdfs/17603.pdf")

# Force fresh parse, bypass cache
syllabus = parse_syllabus("pdfs/17603.pdf", use_cache=False)

Design Document

See comment below for detailed design documentation.

Implements persistent file-based caching for ParsedSyllabus objects to improve
performance and reduce redundant LLM API calls.

Features:
- Content-based caching using SHA-256 hash of PDF bytes
- Pickle serialization for efficient storage
- Cache stored in system cache directory (~/.cache/syllabusmcp/)
- Graceful error handling and cache failures don't break core functionality
- Cache operations: get, set, invalidate, clear, list_keys, get_statistics
- Environment variable support (SYLLABUSMCP_CACHE_DIR, SYLLABUSMCP_DISABLE_CACHE)
- use_cache parameter on parse_syllabus() to bypass cache when needed

Performance improvements:
- 80%+ reduction in parsing time for repeated queries
- 100% reduction in LLM API calls for cached syllabi
- Cache read: ~1-5ms vs LLM parsing: ~2-10 seconds

Testing:
- 16 unit tests for cache operations
- 5 integration tests for parse_syllabus caching
- All 21 tests passing

Closes #6
@schmerl
Copy link
Copy Markdown
Contributor Author

schmerl commented Dec 4, 2025

Design Document

This PR implements the following design for syllabus caching.

Architecture

High-Level Design

┌─────────────────────────────────────────────────────────────┐
│                     SyllabusServer MCP                      │
│                                                             │
│  ┌─────────────┐         ┌──────────────┐                 │
│  │parse_syllabus│────────>│ SyllabusCache │                │
│  │   (tool)     │         │   Manager     │                │
│  └─────────────┘         └──────────────┘                 │
│         │                        │                          │
│         │                        ├─> get()                 │
│         │                        ├─> set()                 │
│         │                        ├─> invalidate()          │
│         │                        └─> clear()               │
│         ▼                        │                          │
│  ┌─────────────┐         ┌──────────────┐                 │
│  │ LLM Parser  │         │ Cache Storage│                 │
│  │ (OpenAI)    │         │  (~/.cache)  │                 │
│  └─────────────┘         └──────────────┘                 │
└─────────────────────────────────────────────────────────────┘

Component Responsibilities

1. SyllabusCache (New)

  • Core cache manager class
  • Handles all cache operations (get, set, invalidate, clear)
  • Manages cache directory and file operations
  • Provides statistics and inspection capabilities

2. parse_syllabus() (Modified)

  • Check cache before parsing
  • On cache hit: return cached ParsedSyllabus
  • On cache miss: parse via LLM, cache result, return

3. Cache Storage (New)

  • File-based storage using pickle
  • One file per cached syllabus
  • Content-addressable via SHA-256 hash

Module Structure

syllabus_server/
├── __init__.py
├── models.py              # Existing: ParsedSyllabus, etc.
├── pdf_utils.py           # Existing: extract_pdf_pages(), _load_pdf_path()
├── server.py              # Modified: integrate cache into parse_syllabus()
└── cache.py               # NEW: SyllabusCache implementation

Cache Key Generation

Algorithm: SHA-256 hash of raw PDF bytes

Rationale:

  • Content-based caching ensures same content = same key
  • Independent of filename or URL
  • SHA-256 provides strong collision resistance
  • Reuses _load_pdf_path() for URL handling

File Storage Structure

~/.cache/syllabusmcp/
└── syllabi/
    ├── a1b2c3d4e5f6...0123.pkl  # Cached syllabus (pickle file)
    ├── f7e8d9c0b1a2...4567.pkl
    └── ...

Each cache file is named with the SHA-256 hash (64 hex characters).

Integration Flow

Before (No Cache)

def parse_syllabus(pdf_path_or_url: str) -> ParsedSyllabus:
    pages = extract_pdf_pages(pdf_path_or_url)
    # ... LLM parsing (~2-10 seconds) ...
    return parsed

After (With Cache)

_cache = SyllabusCache()

def parse_syllabus(pdf_path_or_url: str, use_cache: bool = True) -> ParsedSyllabus:
    # Check cache first
    if use_cache:
        cached = _cache.get(pdf_path_or_url)  # ~1-5ms
        if cached:
            return cached
    
    # Cache miss - parse via LLM
    pages = extract_pdf_pages(pdf_path_or_url)
    # ... LLM parsing (~2-10 seconds) ...
    parsed = ParsedSyllabus(...)
    
    # Cache the result
    if use_cache:
        _cache.set(pdf_path_or_url, parsed)
    
    return parsed

Error Handling

Principle: Graceful Degradation

  • Cache read errors → treat as cache miss
  • Cache write errors → log but continue
  • Corrupted cache files → skip and continue
  • Cache never breaks core functionality

Configuration

Cache Location

  • Default: ~/.cache/syllabusmcp/ (macOS/Linux)
  • Override: SYLLABUSMCP_CACHE_DIR environment variable

Disable Caching

  • Global: SYLLABUSMCP_DISABLE_CACHE=1 environment variable
  • Per-call: parse_syllabus(..., use_cache=False) parameter

Performance Characteristics

Operation Time
Cache key computation ~10-50ms (PDF size dependent)
Cache read ~1-5ms (pickle deserialization)
Cache write ~5-10ms (pickle serialization)
LLM parsing (baseline) ~2-10 seconds

Result: 100-1000x faster for cache hits!

Testing Strategy

Unit Tests (16 tests)

  • Cache miss/hit behavior
  • Key consistency and content hashing
  • Invalidation and clearing
  • Statistics and management
  • Corrupted file handling
  • Environment variable configuration

Integration Tests (5 tests)

  • parse_syllabus() caching behavior
  • Cache disable functionality
  • Persistence across multiple calls

Future Enhancements (Not in v1)

  • Cache versioning for schema changes
  • Cache size limits and LRU eviction
  • Time-based cache expiry
  • SQLite metadata index for fast queries
  • File locking for concurrent access
  • MCP tool integration for external control

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add caching for parsed syllabus data

1 participant