Skip to content

Improve CLI interaction flow#8

Open
meanaverage wants to merge 185 commits intogobbleyourdong:mainfrom
meanaverage:cli-updates
Open

Improve CLI interaction flow#8
meanaverage wants to merge 185 commits intogobbleyourdong:mainfrom
meanaverage:cli-updates

Conversation

@meanaverage
Copy link
Copy Markdown

@meanaverage meanaverage commented Apr 2, 2026

This PR improves the CLI interaction flow and makes the terminal UI feel much closer to a real shell while staying agent-first.

What’s included:

  1. stronger slash-command autocomplete
  2. clearer /project and /serve help text that matches actual command usage
  3. ↑/↓ navigation for suggestions and prompt history recall
  4. Tab and Enter completion behavior that is less confusing in practice
  5. cursor placement fixes after accepting completions
  6. file name and path completion for /attach /unattach
  7. /unattach support for removing attached files
  8. /project del support
  9. Esc and /stop support for stopping active runs without killing the app
  10. 't' to toggle a more useful trace-tail view during active runs
  11. persistent CLI prompt history across app restarts
  12. ctrl-c now erases text first if there is text in the prompt, next ctrl-c is true abort

Files changed:

cli/app.jsx
tsunami/server.py
Validation:

CLI rebundled successfully with esbuild after changes
python3 -m py_compile tsunami/server.py passed

gobbleyourdong and others added 30 commits March 31, 2026 18:18
Default: 9B wave + 2B eddies (auto-scale to leftover mem) + SD-Turbo
Lite: 2B only, no image gen (for 4GB systems)
SD-Turbo (2GB) now included in full mode memory budget.
README simplified to two modes. 13 scaling tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Smoke test validates full one-click experience:
  models respond → wave reasons → eddies parallel → image gen → agent e2e
7/8 pass (image gen needs diffusers in system python — now fixed in installer).
Setup.sh now installs diffusers+torch+accelerate alongside core deps.
One command, everything works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 9B wave built this page with zero human intervention:
- Pure HTML + Tailwind CSS CDN (no build step)
- Hero with wave background + typewriter CLI animation
- Stats bar, architecture diagram, features grid
- Install section with curl command
- 258 lines, 12.8KB, oceanic theme

4 LLM calls, 34K tokens, $0.00 (local model).
This page was built by the tool it describes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ints

- tool_choice: required (Ark: MUST respond with function calling)
- Format: NEVER use bullets for deliverables, paragraphs mandatory
- Search: description now says use 3 query variants, visit sources
- Code: stronger "save to file first" rule, no inline complex code
- Skills/waveforms removed — dead code, AGI doesn't need plugins
- Disclosure protection already existed (verified)
- 607 tests passing, live 9B verified with tool_choice: required

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Solves the monolith problem — instead of the wave writing one huge file,
it specs components, dispatches eddies to write each one, then assembles.
Eddies return code via done tool (no filesystem write needed).
Registered in bootstrap tools. 607 tests passing.

Also: Snake game built autonomously by the 9B wave (209 lines).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tool_choice: required overflows the 9B server when combined with
20 tool schemas (2515 tokens) + system prompt (4000 tokens).
The prompt rule "MUST respond with exactly one tool call" enforces
the same behavior without crashing the server.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Swell: more elegant, more violent. When agents spawn, the swell rises.
tide.py → swell.py, tide_analyze → swell_analyze, tide_build → swell_build.

README rewritten from scratch — no jargon, no walls of tables.
One command install, what it does, how it works, what you need.
Written for people who want to use it, not read about it.

607 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
20 tools (2515 tokens) + system prompt overflowed 9B context on
complex prompts causing 500 Internal Server Errors. Moved
swell_analyze, swell_build, shell_view, plan_advance, file_append
to loadable toolbox. 15 tools (1829 tokens) fits comfortably.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Break: where the wave meets the shore. Results converge at the break.
Full oceanic naming: wave → swell → eddies → break → output.

9B now runs with 32K context (was 16K) — fixes 500 errors on
large file generation. Alphabet tracer building.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRITICAL FIX: Verify loop now in prompt as hard rule:
"NEVER call message_result on code you haven't verified.
Write → verify → fix → deliver."

Snake game: 3 iterations (broken) → 13 iterations (working) after
adding verify. This was the gobbleyourdong#1 reason all 4 apps were broken.

Also: agent loop nudges "save to files" every 5 iterations if
no file writes detected. Prevents context overflow on long tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Watcher caught 6 issues in alphabet tracer build that would have
shipped broken. Config updated: 9B wave, 8192 max tokens, watcher
on by default with 2B eddy at interval 3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wave builds → drag tests → wave fixes → drag retests → ship.
Static analysis (bracket matching, canvas checks, Three.js checks)
+ headless Playwright (console errors, page errors, canvas dims).

Found pinball bug immediately: "Cannot access scoreDisplay before
initialization." Snake, alphabet tracer, node editor all pass.

Named "drag" — the undertow that pulls back what's not ready.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k, undertow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three layers enforce fact-checking:
1. Prompt: mandatory 4-step triangulation (hypothesis → search → cross-ref → deduce)
2. Watcher: 2B checks if message_result has unverified claims
3. Agent loop: triangulation gate blocks delivery of factual claims
   that were never search-verified. Injects warning, forces verification.

From Manus's methodology: parametric memory is unreliable for specifics.
External sources win over training data. Always.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Current: measures tension (0.0 grounded → 1.0 hallucinating)
  - Heuristic: red flags, quality anchors, text patterns
  - Model probe: 2B eddy evaluates factual reliability
  - Correction: iterative re-asking until tension drops

Circulation: routes based on tension reading
  - Low → deliver directly
  - Capability gap → force search
  - Truth gap → explain contradiction
  - Critical → refuse ("I don't know")
  - Post-tool validation: reject results that increase tension

Pressure: tracks tension over time
  - Escalates: calm → moderate → heavy → crushing
  - Forces search after 2 consecutive high readings
  - Forces refusal after 4 consecutive high readings

Validated against THEGAP.md: correctly flags unverified claims,
delivers well-sourced content, forces search on hallucinated stats.
27 new tests, 634 total, all green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces hacky keyword triangulation gate with real tension measurement.
Before delivery: measure_heuristic scores the response (0.0-1.0).
Circulation routes based on score:
  - deliver (grounded)
  - force search (elevated tension, no prior search)
  - refuse (critical tension — say "I don't know")
Pressure tracks tension across session, escalates over time.

The wave can't hallucinate past the current anymore.
634 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Current/circulation/pressure now checks BOTH:
1. Tool choice: "is this the right tool?" (before execution)
2. Response: "is this grounded?" (before delivery)

Pressure tracks consecutive high-tension decisions:
  2+ → force search to ground the agent
  4+ → force message_ask for user guidance

Watcher (2B text reviewer) removed — replaced by the tension
system which is more rigorous and doesn't need an extra LLM call.

634 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
duckduckgo_search renamed to ddgs — was returning 0 results.
Added arXiv API search for research queries (https, follow redirects).
Both tested and working.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tide→swell throughout (tagline, footer, architecture)
- 607→634 tests
- Architecture cards: wave/eddies/swell/break/undertow (was flow/tide/whirlpool)
- Install URL: github raw (was hallucinated tsunami.ark.sh)
- Scaling table: real auto-scaling tiers (was hallucinated linear speedup)
- Eddies: up to 32 (was 4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The undertow tests code. This tests logic.

After the wave writes a synthesis with reasoning claims, an eddy
receives it as a hostile peer reviewer: "assume this is wrong,
find every logical error, scope error, unstated assumption."

If the reviewer finds flaws → objections injected back to wave,
delivery blocked until addressed.
If the reviewer finds no flaws → deliver.

Would have caught the tsunami_gap.md errors:
- "gap is narrower" (wrong — rescaling moves T³ to R³)
- "KNSS proven on R³" (wrong — only axisymmetric cases)
- Sobolev chain applied in wrong context (T³ vs R³ after rescaling)

Three quality gates now:
1. Current/tension — catches hallucinated facts
2. Undertow — catches broken code
3. Adversarial — catches broken reasoning

634 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Undertow completely rewritten as a lever-puller. The wave provides
what to test, the undertow just does it and reports facts. No diagnosis.

- Lever types: screenshot, press, click, read_text, console, wait
- Auto-generates levers from HTML (every ID, key binding, button)
- Eddy vision: screenshot description vs user intent comparison
- DOM + pixel diff for interaction levers (catches subtle changes)
- Visibility check before clicking (no more 30s timeouts)
- code_tension metric (lever fail ratio) feeds into pressure
- Research-before-building mandatory in prompt
- 500 retryable with backoff in model layer
- Delivery gate capped at 2 blocks (was infinite loop)
- Info loop detector tightened to 3/6 (was 5/10)
- arXiv User-Agent header (fixes 429s)
- Current measures prose tension only (undertow measures code)

Results: pinball went from black screen (0.62 tension) to
rendered 3D table (0.21 tension) in fewer iterations (13 vs 25).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root now has only run.py (main entry) and serve_diffusion.py (SD-Turbo).
All test harnesses, stress tests, and verification scripts in tests/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added sections for current/circulation/pressure (the tension system),
the undertow lever-puller architecture, and research-before-building.
Includes before/after results: black screen → rendered 3D pinball.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New search_type="code" hits GitHub API for repos and source code.
No auth needed (public API). Returns repos sorted by stars with
direct links to browse code. Wave reads real implementations
instead of hallucinating API calls.

Tested: "three.js pinball physics" → found pinball-xr (cannon-es),
Three.js forum discussions with working CCD physics examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gobbleyourdong and others added 28 commits April 2, 2026 09:36
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Left: session/project list (click to switch, + New Session)
Center: chat/prompt area
Right: workspace with self-contained file tree + code + preview + terminal

Code view has its own sidebar showing project files.
Click a file to view. Tabs switch between Code/Preview/Terminal.
Sessions maintain separate message histories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o-create

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added overflow:hidden + min-height:0 to chat container and messages.
Prevents long message logs from pushing the input off screen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…plicates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
./workspace/deliverables/x, workspace/deliverables/x, deliverables/x
all resolve correctly now. Was breaking on ./workspace/ prefix causing
'Need to use relative path' errors in CLI builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agent loop:
- Remove max_iterations hard cap — while True, runs until task_complete
- Research gate — nudge search before first write on visual projects
- generate_image + vision_ground in bootstrap (19 tools)
- Vision grounding: VL model extracts element positions → writes layout.css
  with ratio-based CSS (aspect-ratio, percentages, resolution-independent)
- Auto-ground pipeline: generate_image → Qwen-VL → layout.css → agent uses it
- Mid-loop auto-wire: App.tsx wired when 2+ components exist, not just on exit
- Reference save: image search results auto-saved to reference.md
- Swell compile gate: vite build must pass at delivery for React projects
- Skip static HTML undertow for React projects (dev server handles it)
- Dedup loop detection: 3+ identical cached calls → nudge different approach
- Generate nudge every 12 iterations for visual projects
- DDG image search: curl-based fallback bypasses TLS fingerprint detection
- generate.py path fix: /workspace/... resolved relative to actual workspace

Prompt:
- RESEARCH FIRST mandatory, GENERATE ASSETS step, EXTRACT POSITIONS step
- COMPARE to reference, no iteration limit, iterate relentlessly

Infrastructure:
- Dockerfile + docker-compose.yml + docker-entrypoint.sh for containerized runs
- serve_daemon.py — persistent dev server like ComfyUI (:9876, auto-detects projects)
- cli.py: auto-serve latest deliverable after each task, fix manus→tsunami import

Windows:
- setup.ps1: VRAM detection (not RAM) for mode selection, matching setup.bat behavior
- Pre-built llama-server download (pinned b8628 from ggml-org) — no cmake needed
- CUDA DLL bundling alongside llama-server
- exit→return so shell doesn't close on error
- wave/eddy naming (removed queen/bee references)

Scaffolds:
- Fix PixiJS SpriteAnimator: BaseTexture removed in v8, use Texture.from()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Double-clicking the .ps1 opens a new window that closes when the
script finishes. Added ReadKey pause at end and before early returns
so users can read the output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
generate_image now runs SD-Turbo directly in the agent process.
Auto-downloads the 2GB model on first use via HuggingFace diffusers.
No separate server needed. 1 inference step, <1s on GPU.

- generate.py: _try_sd_turbo_local as primary backend, placeholder fallback
- serve_diffusion.py: rewritten for SD-Turbo (was Qwen-Image-2512/13GB)
- setup.sh: installs diffusers+torch+transformers+accelerate

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 2B isn't "gimped" — it's the same model acting as wave on lite boxes.
All eddy endpoint references now read from config.eddy_endpoint (propagated
via TSUNAMI_EDDY_ENDPOINT env var). On lite mode, eddy_endpoint points at
the wave's port (:8090) — one server, one model, both roles.

- config.py: added eddy_endpoint field
- agent.py: propagates config.eddy_endpoint to env var at init
- Replaced all hardcoded :8092 / TSUNAMI_BEE_ENDPOINT across 10 files
- launcher.py: lite mode starts ONE server, sets TSUNAMI_EDDY_ENDPOINT=:8090
- docker-entrypoint.sh: same lite mode fix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When file_write sees useState/useEffect/useRef in a .tsx file without
a React import, auto-prepend the import. The 2B (lite mode wave)
consistently forgets this — builds pass but runtime crashes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Battle-tested on actual Windows hardware. Key fixes over previous version:
- tsu.ps1: model priority chain (27B → MoE → 8B/2B), vision mmproj
  auto-detect, Windows JSON escaping for --chat-template-kwargs,
  FastAPI backend on :3000, Node CLI with Python REPL fallback
- setup.ps1: encoding and escaping fixes from real Windows debugging

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SD-Turbo goes to CPU on tight VRAM (~1min instead of <1s). Worth it
for the 9B wave quality over 2B. 8GB cards get the full stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tsu.ps1: hide llama-server internals from user (no model names,
  no localhost URLs, no "waiting 120s"). Just "Loading model...",
  "Starting up...", "Ready".
- requirements.txt: added fastapi, uvicorn, websockets, ddgs, pillow
  as required (were optional/missing, tsu.ps1 backend needs them)
- setup.sh: DEPS includes all core packages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- vision_ground removed from bootstrap (17 tools). The 2B calls it
  repeatedly when no VL model exists. Auto-ground in agent.py still
  works — it imports the tool directly when generate_image fires.
- message_info + message_result strip non-ASCII before printing.
  Windows cp1252 console crashes on emoji from the 2B.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both tsu (bash) and tsu.ps1 (Windows) silently check for updates
on every launch. Fetch, compare, pull if behind. No user action
needed. Offline gracefully ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@meanaverage
Copy link
Copy Markdown
Author

Hello Mr. Gobbleyourdong,

Please see requested PR before other potentially merge conflicting commits, but the surface here is minimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants