Skip to content

refactor(bonsai): English-verb DSL grammar + lite variant + persistent KV cache#170

Open
KailasMahavarkar wants to merge 3 commits intomainfrom
feat/bonsai-caveman-output
Open

refactor(bonsai): English-verb DSL grammar + lite variant + persistent KV cache#170
KailasMahavarkar wants to merge 3 commits intomainfrom
feat/bonsai-caveman-output

Conversation

@KailasMahavarkar
Copy link
Copy Markdown
Contributor

@KailasMahavarkar KailasMahavarkar commented Apr 20, 2026

Summary

  • Full 100% grammar.lark NL-addressable coverage (94 rules) via English-keyword @-verbs (@upsert, @belief, @Remember, @snapshot, @checkpoint, @CRON_ADD, @EVOLVE_RULE, ...). Short-code abbreviations removed from dispatch.
  • Two prompt variants shipping with the package: full (all 94 verbs, ~1700 tokens) and lite (16 ingest+retrieval verbs, ~800 tokens). Lite mode load drops 19s -> 8s and avoids verb-picking confusion on conversational turns.
  • Auto n_ctx picks the smallest power-of-two that fits the loaded prompt + user-msg budget + output + headroom. Callers can still pin explicitly.
  • Persistent KV cache saves/loads to disk via kv_cache_path. Cold start goes 8s -> 0.4s (19x faster) across process restarts. Meta-guarded so skill / model / n_ctx changes invalidate the cache automatically.

What changed

  • Parser rewritten as a factory dispatch table (_h_slug / _h_topic / _h_walk / _h_pair / _h_query / _h_plain / _h_raw) plus targeted specials for complex verbs.
  • compact dual-mode removed; prompt-driven is the only mode. CompactTurn -> ParsedTurn, _parse_compact_output -> _parse_verb_output, _COMPACT_HANDLERS -> _VERB_HANDLERS, _DEFAULT_COMPACT_* symbols deleted.
  • Prompt files moved from tools/skills/graphstore-bonsai-dsl-compact/ into src/graphstore/ so they ship with the wheel. pyproject.toml package-data now includes *.txt.
  • Grammar bugs fixed:
    • @SNAPSHOT with no name auto-fills a UTC timestamp (grammar requires SNAPSHOT STRING).
    • @COMPACT now emits SYS OPTIMIZE COMPACT (bare SYS COMPACT isn't a real rule).

Performance (AMD 9700X, DDR5-5200, 4B TQ1_0)

  • Peak decode 27-30 tok/s (memory-bandwidth bound at ~810 MB weight read per token)
  • Per-call wall 0.3-2s; overall 15-20 tok/s
  • Cold load 8s (lite) / 19s (full)
  • Cold load with persistent KV cache: 0.4s (19x faster)
  • @-prefix parser gate makes English drift / <think> leaks / fences inert at parser level (no DSL corruption possible)

Test plan

  • 89 unit tests pass (pytest tests/test_bonsai_ingestor.py)
  • 107 synthesized DSL templates parse clean vs grammar.lark
  • Live bench on 10 in-scope prompts: 10/10 correct on lite scope
  • Persistent KV cache demo measured 19x cold-start speedup
  • LoCoMo F1 rerun vs baseline (follow-up PR)
  • End-to-end smoke with live GraphStore round-trip (follow-up PR)

Generated with Claude Code.

Before: U/F/D compact verbs for the 3 ingest paths, with an `!` escape
hatch to pass any other DSL through verbatim. Model had to remember two
tokenizations and the escape never compressed.

After: one positional verb table covering the whole common DSL surface -
ingest (U/F/D), edges (E), retrieval (RM/SM/LX/AQ), walks (RL/TR/AN/SG),
sys/vault (SS/SC/SH/ST/SX/VS). Python expands each to the full DSL line.
Every path hits the same ~3-5x output-token reduction now, not just
ingest.

Changes:
- Replace `_parse_compact_output` with a dispatch table of verb handlers
  built from small factories (_h_upsert, _h_fact, _h_drop, _h_edge,
  _h_query, _h_walk, _h_plain).
- Swap `CompactTurn.raw_dsl` for `CompactTurn.statements`: pre-rendered
  DSL lines for every non-ingest verb. `_synthesize_dsl` appends them
  verbatim after the message node + mention wiring + fact updates.
- SKILL.md rewritten to v5: 16 verbs documented, examples per common
  path, ~900 tokens.
- Unit tests: drop `!`-escape block, add per-verb coverage for edges,
  all 4 retrieval verbs, all 4 walks, all 6 sys/vault ops, and the
  aliased long forms (REMEMBER/SIMILAR/RECALL/TRAVERSE).

Test results: 78/78 bonsai unit tests pass; full suite 1880 pass,
101 skip (unchanged).
The NL->DSL ingestor now covers 100% of the grammar.lark NL-addressable
surface (94 rules) via English-keyword @-verbs. Short-code abbreviations
(@U/@F/@RM/etc.) are gone - every verb is a readable DSL keyword
(@upsert, @belief, @Remember, @snapshot, @checkpoint, @CRON_ADD,
@EVOLVE_RULE, ...). Full dispatch has ~100 entries including grammar
aliases (ASSERT->BELIEF, FORGET_NODE->FORGET, etc.).

Two prompt variants now ship with the package:
  - bonsai_dsl_prompt.txt: full 94-verb surface, ~1700 tokens, n_ctx=4096
  - bonsai_dsl_prompt_lite.txt: 16-verb ingest+retrieval subset,
    ~800 tokens, n_ctx=2048. Fewer competing verbs means the model picks
    correctly on conversational turns. Load time 19s -> 8s.

n_ctx auto-picks smallest power-of-two that fits the loaded prompt +
typical user-msg budget + max_output + headroom. Callers can still
pin n_ctx explicitly.

Parser rewritten as a factory dispatch table:
  - _h_slug / _h_topic / _h_walk / _h_pair / _h_query / _h_plain / _h_raw
  - Special handlers for update_node, merge, increment, propagate,
    describe, unregister, contradictions, cron_add, optimize, clear,
    wal, nodes, vault_triplet, snapshot (auto-timestamp fallback).

Compact/raw-DSL mode removed. Single prompt-driven mode. `compact`
kwarg and `_DEFAULT_COMPACT_*` symbols deleted. `CompactTurn` renamed
`ParsedTurn`, `_parse_compact_output` -> `_parse_verb_output`,
`_COMPACT_HANDLERS` -> `_VERB_HANDLERS`.

Grammar bugs fixed in this pass:
  - @snapshot without name auto-fills a UTC timestamp (SNAPSHOT STRING
    is required by grammar; bare @ss was emitting invalid DSL).
  - @compact rewritten to SYS OPTIMIZE COMPACT (SYS COMPACT isn't a
    real grammar rule; SYS OPTIMIZE COMPACT is).

Prompt file moved out of tools/skills/ into src/graphstore/ so it
ships with the wheel. pyproject package-data now includes *.txt.

Performance envelope measured on AMD 9700X / DDR5-5200:
  - Cold load: 8s (lite) / 19s (full)
  - Cold load with persistent kv_cache_path: 0.4s (19x faster)
  - Peak decode: 27-30 tok/s (memory-bandwidth bound at ~810 MB
    weight read per token for 4B TQ1_0)
  - Per-call wall: 0.3-2s; overall 15-20 tok/s

Tests: 89 unit tests pass; 107 synthesized DSL templates parse clean
against grammar.lark (verify_v6_templates.py check).
@KailasMahavarkar KailasMahavarkar changed the title refactor(bonsai): compact v5 unified 16-verb grammar refactor(bonsai): English-verb DSL grammar + lite variant + persistent KV cache Apr 20, 2026
@retract

LongMemEval smoke revealed two real drift patterns in the lite prompt:

1. Spurious @retract on unrelated turns: with KNOWN FACTS present, the
   model was emitting @retract + @belief even when the new turn was
   about a different topic entirely. The correction-flow example in
   the prompt was the attractor: model pattern-matched any new fact
   to a correction.

2. @recall misfire on personal-fact questions: "Which city did I move
   to last year?" emitted @recall location (wrong verb, bare anchor).
   Model thought the belief topic "location" was a valid walk anchor.

Prompt changes:

- VERB PICK RULE now distinguishes personal-fact questions ("Where
  did I ...?", "Which city did I ...?") which route to @answer, from
  named-entity connection questions which route to @recall.

- Added explicit rule: walk/path verbs (@recall, @traverse, @Ancestors,
  @descendants, @subgraph, @path, @SHORTEST_PATH, @common) REQUIRE a
  prefixed anchor id (ent:X / fact:X / msg:X). Bare topic names like
  "location" are not valid anchors.

- Added explicit rule: @retract only fires on correction trigger words
  ("actually", "not anymore", "changed to", "now prefer", "instead").
  Unrelated new turns must NOT emit @retract even if related beliefs
  are in KNOWN FACTS.

- New NEGATIVE example showing KNOWN FACTS [fact:location]="Seattle"
  plus unrelated turn "I bought a new guitar" -> only @belief
  purchase guitar. No retract.

- Two new @answer examples for personal-fact questions: "Which city
  did I move to last year?" and "What is my favorite color?".

Verified by re-running tools/scripts style LongMemEval smoke on the
fixture: both drift cases now produce correct ops (@BELIEF-only for
unrelated turns, @answer for personal-fact questions).

Tests: 89/89 pass, no unit-test deltas (pure prompt change).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant