Skip to content

Improve Wildcard Query Performance #129

@johannes-wolf

Description

@johannes-wolf

Recursive wildcard steps (**) currently traverse all descendants. On deep/wide trees this dominates runtime. We need a fast, safe way to skip subtrees that cannot possibly contain certain field names (e.g., when a query step is **.e or **.(e|k)).

Proposed direction

Add a lightweight may-have capability to ModelNode and use it to prune during ** traversal whenever the next step constrains field names.

// default: true (no pruning) so legacy models stay correct
virtual bool mayHave(StringId field) const noexcept;

In the evaluator, before recursing into a child under **, check mayHave(fieldId) (any-of for multiple candidate names). If unconstrained (pure **), do nothing.

Design options (pick 1 now, keep the API stable)

  1. Per-object shared “may-have list” (deterministic)

    • Each Object holds a small handle (e.g., 16-bit id) to an interned set of all field names reachable in its subtree.
    • Built bottom-up; identical sets are shared.
    • Pros: simple, no false positives, great for future completions.
    • Cons: +2 bytes per object (handle) + storage for unique sets; maintenance on updates.
  2. Per-object Bloom filter (compact, probabilistic)

    • Replace the set with a fixed-size Bloom filter OR’ed from children + own names.
    • Pros: constant memory per unique set, very fast checks.
    • Cons: false positives ⇒ less pruning; tune bits/entry to keep FPR low.
  3. Global structural summary (“index object” / DataGuide-like)

    • Build a de-duplicated graph of object “shapes”; map each runtime object to a shape id and prune via the summary.
    • Pros: powerful for completions and introspection.
    • Cons: higher complexity to maintain mappings/overlays; bigger upfront work.

Recommended first step

Implement Option 1 behind a feature flag (SIMFIL_MAYHAVE_PRUNE):

  • Add mayHave to ModelNode; Object answers via its shared set.
  • Bottom-up builder that assigns/ interns may-have sets; cap at 65k unique sets (fall back to “unknown” if exceeded).
  • Guard ** recursion with mayHave when the next step names fields.

Benchmarks & diagnostics

  • Measure nodes visited, eval time, and memory vs baseline on synthetic trees and real models.
  • REPL: show visited/pruned counters; optional dump of a node’s may-have set for debugging.

Compatibility

  • If metadata is absent (old models / cap exceeded), behavior is identical to today (always true).
  • No false negatives with Option 1; Option 2 trades memory for a tunable false-positive rate.

Future work

  • Pluggable backend (swap sets ↔ Bloom filter) under the same mayHave API.
  • Optional global summary for cross-model completions and schema export.

References (clickable)

  • DataGuides (structural summaries for semistructured DBs). Goldman & Widom, VLDB 1997. [PDF] [alt]
  • XPath Accelerator (indexing descendant axes). Grust, 2002. [PDF] [DOI]
  • Bloom filters (foundations). Bloom, CACM 1970. [DOI] [PDF mirror]
  • Bloom filters in Parquet (predicate pushdown). Apache Parquet docs. [Docs]
  • Hierarchical Bloom filters. Koloniari & Pitoura, WDAS 2003 (XML trees). [PDF]
  • Bloofi (hierarchical index over many Bloom filters). Crainiceanu et al., CIKM 2013. [DOI] [slides]

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions