Improve Wildcard Query Performance

Recursive wildcard steps (`**`) currently traverse *all* descendants. On deep/wide trees this dominates runtime. We need a fast, safe way to **skip subtrees that cannot possibly contain certain field names** (e.g., when a query step is `**.e` or `**.(e|k)`).

### Proposed direction
Add a lightweight **may-have** capability to `ModelNode` and use it to prune during `**` traversal whenever the next step constrains field names.

```cpp
// default: true (no pruning) so legacy models stay correct
virtual bool mayHave(StringId field) const noexcept;
```

In the evaluator, before recursing into a child under `**`, check `mayHave(fieldId)` (any-of for multiple candidate names). If unconstrained (pure `**`), do nothing.

### Design options (pick 1 now, keep the API stable)
1. **Per-object shared “may-have list” (deterministic)**
   - Each `Object` holds a small handle (e.g., 16-bit id) to an interned set of **all field names reachable in its subtree**.
   - Built bottom-up; identical sets are shared.
   - Pros: simple, no false positives, great for future **completions**.  
   - Cons: +2 bytes per object (handle) + storage for unique sets; maintenance on updates.

2. **Per-object Bloom filter (compact, probabilistic)**
   - Replace the set with a fixed-size Bloom filter OR’ed from children + own names.
   - Pros: constant memory per unique set, very fast checks.  
   - Cons: false positives ⇒ less pruning; tune bits/entry to keep FPR low.

3. **Global structural summary (“index object” / DataGuide-like)**
   - Build a de-duplicated graph of object “shapes”; map each runtime object to a shape id and prune via the summary.
   - Pros: powerful for **completions** and introspection.  
   - Cons: higher complexity to maintain mappings/overlays; bigger upfront work.

### Recommended first step
Implement **Option 1** behind a feature flag (`SIMFIL_MAYHAVE_PRUNE`):
- Add `mayHave` to `ModelNode`; `Object` answers via its shared set.
- Bottom-up builder that assigns/ interns may-have sets; cap at 65k unique sets (fall back to “unknown” if exceeded).
- Guard `**` recursion with `mayHave` when the next step names fields.

### Benchmarks & diagnostics
- Measure nodes visited, eval time, and memory vs baseline on synthetic trees and real models.
- REPL: show visited/pruned counters; optional dump of a node’s may-have set for debugging.

### Compatibility
- If metadata is absent (old models / cap exceeded), behavior is identical to today (always `true`).
- No false negatives with Option 1; Option 2 trades memory for a tunable false-positive rate.

### Future work
- Pluggable backend (swap sets ↔ Bloom filter) under the same `mayHave` API.
- Optional global summary for cross-model completions and schema export.

---

### References (clickable)
- **DataGuides (structural summaries for semistructured DBs).** Goldman & Widom, VLDB 1997. [[PDF]](https://www.vldb.org/conf/1997/P436.PDF)  [[alt]](https://infolab.stanford.edu/lore/pubs/dataguide_vldb97.pdf)  
- **XPath Accelerator (indexing descendant axes).** Grust, 2002. [[PDF]](https://db.cs.uni-tuebingen.de/publications/2002/accelerating-xpath-location-steps/xpath-accel.pdf)  [[DOI]](https://dl.acm.org/doi/10.1145/564691.564705)  
- **Bloom filters (foundations).** Bloom, CACM 1970. [[DOI]](https://dl.acm.org/doi/10.1145/362686.362692)  [[PDF mirror]](https://crystal.uta.edu/~mcguigan/cse6350/papers/Bloom.pdf)  
- **Bloom filters in Parquet (predicate pushdown).** Apache Parquet docs. [[Docs]](https://parquet.apache.org/docs/file-format/bloomfilter/)  
- **Hierarchical Bloom filters.** Koloniari & Pitoura, WDAS 2003 (XML trees). [[PDF]](https://www.cs.uoi.gr/~pitoura/distribution/wdas03.pdf)  
- **Bloofi (hierarchical index over many Bloom filters).** Crainiceanu et al., CIKM 2013. [[DOI]](https://dl.acm.org/doi/10.1145/2501928.2501931)  [[slides]](https://eric.univ-lyon2.fr/cloud-i/eric.univ-lyon2.fr/cloud-i/wp-content/uploads/2013/08/4-Bloofi_CloudI_2013.pdf)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Wildcard Query Performance #129

Proposed direction

Design options (pick 1 now, keep the API stable)

Recommended first step

Benchmarks & diagnostics

Compatibility

Future work

References (clickable)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Wildcard Query Performance #129

Description

Proposed direction

Design options (pick 1 now, keep the API stable)

Recommended first step

Benchmarks & diagnostics

Compatibility

Future work

References (clickable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions