-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Recursive wildcard steps (**) currently traverse all descendants. On deep/wide trees this dominates runtime. We need a fast, safe way to skip subtrees that cannot possibly contain certain field names (e.g., when a query step is **.e or **.(e|k)).
Proposed direction
Add a lightweight may-have capability to ModelNode and use it to prune during ** traversal whenever the next step constrains field names.
// default: true (no pruning) so legacy models stay correct
virtual bool mayHave(StringId field) const noexcept;In the evaluator, before recursing into a child under **, check mayHave(fieldId) (any-of for multiple candidate names). If unconstrained (pure **), do nothing.
Design options (pick 1 now, keep the API stable)
-
Per-object shared “may-have list” (deterministic)
- Each
Objectholds a small handle (e.g., 16-bit id) to an interned set of all field names reachable in its subtree. - Built bottom-up; identical sets are shared.
- Pros: simple, no false positives, great for future completions.
- Cons: +2 bytes per object (handle) + storage for unique sets; maintenance on updates.
- Each
-
Per-object Bloom filter (compact, probabilistic)
- Replace the set with a fixed-size Bloom filter OR’ed from children + own names.
- Pros: constant memory per unique set, very fast checks.
- Cons: false positives ⇒ less pruning; tune bits/entry to keep FPR low.
-
Global structural summary (“index object” / DataGuide-like)
- Build a de-duplicated graph of object “shapes”; map each runtime object to a shape id and prune via the summary.
- Pros: powerful for completions and introspection.
- Cons: higher complexity to maintain mappings/overlays; bigger upfront work.
Recommended first step
Implement Option 1 behind a feature flag (SIMFIL_MAYHAVE_PRUNE):
- Add
mayHavetoModelNode;Objectanswers via its shared set. - Bottom-up builder that assigns/ interns may-have sets; cap at 65k unique sets (fall back to “unknown” if exceeded).
- Guard
**recursion withmayHavewhen the next step names fields.
Benchmarks & diagnostics
- Measure nodes visited, eval time, and memory vs baseline on synthetic trees and real models.
- REPL: show visited/pruned counters; optional dump of a node’s may-have set for debugging.
Compatibility
- If metadata is absent (old models / cap exceeded), behavior is identical to today (always
true). - No false negatives with Option 1; Option 2 trades memory for a tunable false-positive rate.
Future work
- Pluggable backend (swap sets ↔ Bloom filter) under the same
mayHaveAPI. - Optional global summary for cross-model completions and schema export.
References (clickable)
- DataGuides (structural summaries for semistructured DBs). Goldman & Widom, VLDB 1997. [PDF] [alt]
- XPath Accelerator (indexing descendant axes). Grust, 2002. [PDF] [DOI]
- Bloom filters (foundations). Bloom, CACM 1970. [DOI] [PDF mirror]
- Bloom filters in Parquet (predicate pushdown). Apache Parquet docs. [Docs]
- Hierarchical Bloom filters. Koloniari & Pitoura, WDAS 2003 (XML trees). [PDF]
- Bloofi (hierarchical index over many Bloom filters). Crainiceanu et al., CIKM 2013. [DOI] [slides]