Skip to content

feat: query result caching with plan-based hashing #129

@zfarrell

Description

@zfarrell

Problem

Query results are persisted but are currently only accessible when explicitly requested or queried. Identical (or logically equivalent) queries always re-execute (even though they could use an existing result), and result storage has no deduplication.

Goals

  1. Serve cached results for repeat/equivalent queries without re-execution
  2. Deduplicate stored result parquet files so logically identical queries share one result

Design

Plan-based hashing instead of SQL text hashing. After DataFusion produces the optimized logical plan, hash its canonical representation (LogicalPlan::display_indent()). This normalizes away SQL formatting, comments, aliases, and trivially reordered predicates. The existing sql_hash / SqlSnapshot mechanism stays for saved-query versioning; this is a new, separate plan_hash on query results.

Invalidation via sync events, not query-time freshness checks. When refresh_table_data or refresh_schema completes for a table, invalidate (mark expired) all cached results whose logical plan references that table. This piggybacks on the existing atomic-swap refresh path — zero cost at query time. Store table_refs alongside each cached result to enable this lookup.

Configurable TTL as a fallback. For connections with frequently-changing backing data (or sources where refresh events aren't triggered), support an optional per-connection result_cache_ttl (e.g. 15m, 1h). When set, results expire after that duration regardless of sync events. When unset, results live until explicitly invalidated by a refresh. Default: unset (invalidation-driven).

Distributed via Redis. Cache index (plan_hash → result location, table_refs, expiry state) lives in Redis, consistent with the existing CachingManager pattern. Result parquet files are already written to shared storage (S3/filesystem). This means any engine instance can serve a cached result produced by another instance. Invalidation on refresh becomes a Redis key update, visible cluster-wide immediately.

Key fields (stored in Redis, keyed by plan_hash)

plan_hash: String          // SHA-256 of canonical logical plan
result_path: String        // parquet location in shared storage
table_refs: Vec<TableRef>  // tables referenced, for invalidation
cached_at: DateTime<Utc>
expired: bool              // set true on refresh or TTL expiry

Query-time flow

  1. Parse SQL, resolve catalogs, produce optimized logical plan
  2. Compute plan_hash
  3. Look up non-expired entry in Redis with matching plan_hash
    • Hit: return result from shared storage (skip execution + persistence)
    • Miss: execute, persist to shared storage, write cache entry to Redis
  4. For concurrent duplicate queries: use a Redis lock on plan_hash so only one instance executes; others wait and then read the cached result

Invalidation flow

  • On refresh_table_data(conn, schema, table): mark all Redis entries referencing that table as expired = true (reverse index: table → [plan_hash])
  • Background TTL sweep (runs on stale_result_cleanup_interval): expire entries older than their connection's result_cache_ttl, if configured

Deduplication

Results are keyed by plan_hash. A second execution of a logically-equivalent query — from any instance — reuses the same result parquet rather than writing a duplicate.

Out of scope

  • Partial result caching / materialized views
  • Cache size limits / eviction (follow-up)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions