-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Problem
Query results are persisted but are currently only accessible when explicitly requested or queried. Identical (or logically equivalent) queries always re-execute (even though they could use an existing result), and result storage has no deduplication.
Goals
- Serve cached results for repeat/equivalent queries without re-execution
- Deduplicate stored result parquet files so logically identical queries share one result
Design
Plan-based hashing instead of SQL text hashing. After DataFusion produces the optimized logical plan, hash its canonical representation (LogicalPlan::display_indent()). This normalizes away SQL formatting, comments, aliases, and trivially reordered predicates. The existing sql_hash / SqlSnapshot mechanism stays for saved-query versioning; this is a new, separate plan_hash on query results.
Invalidation via sync events, not query-time freshness checks. When refresh_table_data or refresh_schema completes for a table, invalidate (mark expired) all cached results whose logical plan references that table. This piggybacks on the existing atomic-swap refresh path — zero cost at query time. Store table_refs alongside each cached result to enable this lookup.
Configurable TTL as a fallback. For connections with frequently-changing backing data (or sources where refresh events aren't triggered), support an optional per-connection result_cache_ttl (e.g. 15m, 1h). When set, results expire after that duration regardless of sync events. When unset, results live until explicitly invalidated by a refresh. Default: unset (invalidation-driven).
Distributed via Redis. Cache index (plan_hash → result location, table_refs, expiry state) lives in Redis, consistent with the existing CachingManager pattern. Result parquet files are already written to shared storage (S3/filesystem). This means any engine instance can serve a cached result produced by another instance. Invalidation on refresh becomes a Redis key update, visible cluster-wide immediately.
Key fields (stored in Redis, keyed by plan_hash)
plan_hash: String // SHA-256 of canonical logical plan
result_path: String // parquet location in shared storage
table_refs: Vec<TableRef> // tables referenced, for invalidation
cached_at: DateTime<Utc>
expired: bool // set true on refresh or TTL expiry
Query-time flow
- Parse SQL, resolve catalogs, produce optimized logical plan
- Compute
plan_hash - Look up non-expired entry in Redis with matching
plan_hash- Hit: return result from shared storage (skip execution + persistence)
- Miss: execute, persist to shared storage, write cache entry to Redis
- For concurrent duplicate queries: use a Redis lock on
plan_hashso only one instance executes; others wait and then read the cached result
Invalidation flow
- On
refresh_table_data(conn, schema, table): mark all Redis entries referencing that table asexpired = true(reverse index:table → [plan_hash]) - Background TTL sweep (runs on
stale_result_cleanup_interval): expire entries older than their connection'sresult_cache_ttl, if configured
Deduplication
Results are keyed by plan_hash. A second execution of a logically-equivalent query — from any instance — reuses the same result parquet rather than writing a duplicate.
Out of scope
- Partial result caching / materialized views
- Cache size limits / eviction (follow-up)